Merge of HSA branch

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Merge of HSA branch
@ 2015-11-05 21:51 Martin Jambor
  2015-11-05 21:53 ` [hsa 1/12] Configuration and offloading-related changes Martin Jambor
                   ` (12 more replies)
  0 siblings, 13 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 21:51 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Richard Biener, Martin Liska, Michael Matz

Hi,

we have had a few last-minute issues, so I apologize this is happening
a bit late, but I would like to start the process or merging the HSA
branch to GCC trunk.  Changes to different parts of the compiler and
libgomp are posted as individual emails in this thread.  No two
patches touch any single file and almost all of them need to be
applied together at once, I have only split the one big patch up to
ease review.

Individual changes are described in slightly more detail in their
respective messages.  If you are interested in how the HSAIL
generation works in general, I encourage you to have a look at my
Cauldron slides or presentation, only very few things have changed as
far as the general principles are concerned.  Let me just quickly stress
here that we do acceleration within a single compiler, as opposed to
LTO-ways of all the other accelerator teams.

We have bootsrapped and tested the patched GCC with and without HSA
enabled on x86_64-linux and found only a single issue (see below) a
few new HSA warnings when HSA was enabled.  We have also run the full
C, C++ and Fortran testsuite on a computer that actually has a Carrizo
HSA APU and the result was the same.

The single issue is that libgomp.oacc-c/../libgomp.oacc-c-c++-common/*
tests get run using our HSA plugin, which however cannot do OpenACC.
I believe it should be something very easy to fix and I did not want
to hold off the submission and review because of it.

I also acknowledge that we should add HSA-specific tests to the GCC
testsuite but we are only now looking at how to do that and will
welcome any guidance in this regard.

I acknowledge that the submission comes quite late and that the class
of OpenMP loops we can handle well is small, but nevertheless I would
like to ask for review and eventual acceptance to trunk and GCC 6.

I'll be grateful for any review, comments or questions,

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 1/12] Configuration and offloading-related changes
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
@ 2015-11-05 21:53 ` Martin Jambor
  2015-11-05 22:47   ` Joseph Myers
  2015-11-05 21:54 ` [hsa 2/12] Modifications to libgomp proper Martin Jambor
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 21:53 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek

Hi,

this patch contains changes to the configuration mechanism and offload
bits, so that users can build compilers with HSA support and it plays
nicely with other accelerators despite using an altogether different
implementation approach.

With this patch, the user can request HSA support by including the
string "hsa" among the requested accelerators in
--enable-offload-targets.  This will cause the compiler to start
producing HSAIL for target OpenMP regions/functions and the hsa
libgomp plugin to be built.  Because the plugin needs to use HSA
run-time library, I have introduced options --with-hsa-runtime (and
more precise --with-hsa-include and --with-hsa-lib) to help find it.

The catch is however that there is no offload compiler for HSA and so
the wrapper should not attempt to look for it (that is what the hunk
in lto-wrapper.c does) and when HSA is the only accelerator, it is
wasteful to output LTO sections with byte-code and therefore if HSA is
the only configured accelerator, it does not set ENABLE_OFFLOADING
macro.

Finally, when the compiler has been configured for HSA but the user
disables it by omitting it in the -foffload compiler option, we need
to observe that decision.  That is what the opts.c hunk does.

Thanks,

Martin


2015-11-04  Martin Jambor  <mjambor@suse.cz>

gcc/
	* Makefile.in (OBJS): Add new source files.
	(GTFILES): Add hsa.c.
	* configure.ac (accel_dir_suffix): Treat hsa specially.
	(OFFLOAD_TARGETS): Define ENABLE_OFFLOADING according to
	$enable_offloading.
	(ENABLE_HSA): Define ENABLE_HSA according to $enable_hsa.
	* lto-wrapper.c (compile_images_for_offload_targets): Do not attempt
	to invoke offload compiler for hsa acclerator.
	* opts.c (common_handle_option): Determine whether HSA offloading
	should be performed.

libgomp/plugin/
	* Makefrag.am: Add HSA plugin requirements.
	* configfrag.ac (HSA_RUNTIME_INCLUDE): New variable.
	(HSA_RUNTIME_LIB): Likewise.
	(HSA_RUNTIME_CPPFLAGS): Likewise.
	(HSA_RUNTIME_INCLUDE): New substitution.
	(HSA_RUNTIME_LIB): Likewise.
	(HSA_RUNTIME_LDFLAGS): Likewise.
	(hsa-runtime): New configure option.
	(hsa-runtime-include): Likewise.
	(hsa-runtime-lib): Likewise.
	(PLUGIN_HSA): New substitution variable.
	Fill HSA_RUNTIME_INCLUDE and HSA_RUNTIME_LIB according to the new
	configure options.
	(PLUGIN_HSA_CPPFLAGS): Likewise.
	(PLUGIN_HSA_LDFLAGS): Likewise.
	(PLUGIN_HSA_LIBS): Likewise.
	Check that we have access to HSA run-time.
	(PLUGIN_NVPTX): New conditional.


diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 7d53a7d..2ca16b1 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1293,6 +1293,11 @@ OBJS = \
 	graphite-sese-to-poly.o \
 	gtype-desc.o \
 	haifa-sched.o \
+	hsa.o \
+	hsa-gen.o \
+	hsa-regalloc.o \
+	hsa-brig.o \
+	hsa-dump.o \
 	hw-doloop.o \
 	hwint.o \
 	ifcvt.o \
@@ -1317,6 +1322,7 @@ OBJS = \
 	ipa-icf.o \
 	ipa-icf-gimple.o \
 	ipa-reference.o \
+	ipa-hsa.o \
 	ipa-ref.o \
 	ipa-utils.o \
 	ipa.o \
@@ -2379,6 +2385,7 @@ GTFILES = $(CPP_ID_DATA_H) $(srcdir)/input.h $(srcdir)/coretypes.h \
   $(srcdir)/sanopt.c \
   $(srcdir)/ipa-devirt.c \
   $(srcdir)/internal-fn.h \
+  $(srcdir)/hsa.c \
   @all_gtfiles@
 
 # Compute the list of GT header files from the corresponding C sources,
diff --git a/gcc/configure.ac b/gcc/configure.ac
index 7e22267..d0d4565 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -943,6 +943,13 @@ AC_SUBST(accel_dir_suffix)
 
 for tgt in `echo $enable_offload_targets | sed 's/,/ /g'`; do
   tgt=`echo $tgt | sed 's/=.*//'`
+
+  if echo "$tgt" | grep "^hsa" > /dev/null ; then
+    enable_hsa=1
+  else
+    enable_offloading=1
+  fi
+
   if test x"$offload_targets" = x; then
     offload_targets=$tgt
   else
@@ -951,11 +958,16 @@ for tgt in `echo $enable_offload_targets | sed 's/,/ /g'`; do
 done
 AC_DEFINE_UNQUOTED(OFFLOAD_TARGETS, "$offload_targets",
   [Define to offload targets, separated by commas.])
-if test x"$offload_targets" != x; then
+if test x"$enable_offloading" != x; then
   AC_DEFINE(ENABLE_OFFLOADING, 1,
     [Define this to enable support for offloading.])
 fi
 
+if test x"$enable_hsa" = x1 ; then
+  AC_DEFINE(ENABLE_HSA, 1,
+    [Define this to enable support for generating HSAIL.])
+fi
+
 AC_ARG_WITH(multilib-list,
 [AS_HELP_STRING([--with-multilib-list], [select multilibs (AArch64, SH and x86-64 only)])],
 :,
diff --git a/gcc/lto-wrapper.c b/gcc/lto-wrapper.c
index 20e67ed..5f564d9 100644
--- a/gcc/lto-wrapper.c
+++ b/gcc/lto-wrapper.c
@@ -745,6 +745,11 @@ compile_images_for_offload_targets (unsigned in_argc, char *in_argv[],
   offload_names = XCNEWVEC (char *, num_targets + 1);
   for (unsigned i = 0; i < num_targets; i++)
     {
+      /* HSA does not use LTO-like streaming and a different compiler, skip
+	 it. */
+      if (strncmp(names[i], "hsa", 3) == 0)
+	continue;
+
       offload_names[i]
 	= compile_offload_image (names[i], compiler_path, in_argc, in_argv,
 				 compiler_opts, compiler_opt_count,
diff --git a/gcc/opts.c b/gcc/opts.c
index 9a3fbb3..5c9aca9 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -1894,8 +1894,35 @@ common_handle_option (struct gcc_options *opts,
       break;
 
     case OPT_foffload_:
-      /* Deferred.  */
-      break;
+      {
+	const char *p = arg;
+	opts->x_flag_disable_hsa = true;
+	while (*p != 0)
+	  {
+	    const char *comma = strchr (p, ',');
+
+	    if ((strncmp (p, "disable", 7) == 0)
+		&& (p[7] == ',' || p[7] == '\0'))
+	      {
+		opts->x_flag_disable_hsa = true;
+		break;
+	      }
+
+	    if ((strncmp (p, "hsa", 3) == 0)
+		&& (p[3] == ',' || p[3] == '\0'))
+	      {
+#ifdef ENABLE_HSA
+		opts->x_flag_disable_hsa = false;
+#else
+		sorry ("HSA has not been enabled during configuration");
+#endif
+	      }
+	    if (!comma)
+	      break;
+	    p = comma + 1;
+	  }
+	break;
+      }
 
 #ifndef ACCEL_COMPILER
     case OPT_foffload_abi_:
diff --git a/libgomp/plugin/Makefrag.am b/libgomp/plugin/Makefrag.am
index 745becd..433bba1 100644
--- a/libgomp/plugin/Makefrag.am
+++ b/libgomp/plugin/Makefrag.am
@@ -38,3 +38,16 @@ libgomp_plugin_nvptx_la_LDFLAGS += $(PLUGIN_NVPTX_LDFLAGS)
 libgomp_plugin_nvptx_la_LIBADD = libgomp.la $(PLUGIN_NVPTX_LIBS)
 libgomp_plugin_nvptx_la_LIBTOOLFLAGS = --tag=disable-static
 endif
+
+if PLUGIN_HSA
+# Heterogenous Systems Architecture plugin
+libgomp_plugin_hsa_version_info = -version-info $(libtool_VERSION)
+toolexeclib_LTLIBRARIES += libgomp-plugin-hsa.la
+libgomp_plugin_hsa_la_SOURCES = plugin/plugin-hsa.c
+libgomp_plugin_hsa_la_CPPFLAGS = $(AM_CPPFLAGS) $(PLUGIN_HSA_CPPFLAGS)
+libgomp_plugin_hsa_la_LDFLAGS = $(libgomp_plugin_hsa_version_info) \
+	$(lt_host_flags)
+libgomp_plugin_hsa_la_LDFLAGS += $(PLUGIN_HSA_LDFLAGS)
+libgomp_plugin_hsa_la_LIBADD = libgomp.la $(PLUGIN_HSA_LIBS)
+libgomp_plugin_hsa_la_LIBTOOLFLAGS = --tag=disable-static
+endif
diff --git a/libgomp/plugin/configfrag.ac b/libgomp/plugin/configfrag.ac
index ad70dd1..c50e5cb 100644
--- a/libgomp/plugin/configfrag.ac
+++ b/libgomp/plugin/configfrag.ac
@@ -81,6 +81,54 @@ AC_SUBST(PLUGIN_NVPTX_CPPFLAGS)
 AC_SUBST(PLUGIN_NVPTX_LDFLAGS)
 AC_SUBST(PLUGIN_NVPTX_LIBS)
 
+# Look for HSA run-time, its includes and libraries
+
+HSA_RUNTIME_INCLUDE=
+HSA_RUNTIME_LIB=
+AC_SUBST(HSA_RUNTIME_INCLUDE)
+AC_SUBST(HSA_RUNTIME_LIB)
+HSA_RUNTIME_CPPFLAGS=
+HSA_RUNTIME_LDFLAGS=
+
+AC_ARG_WITH(hsa-runtime,
+	[AS_HELP_STRING([--with-hsa-runtime=PATH],
+		[specify prefix directory for installed HSA run-time package.
+		 Equivalent to --with-hsa-runtime-include=PATH/include
+		 plus --with-hsa-runtime-lib=PATH/lib])])
+AC_ARG_WITH(hsa-runtime-include,
+	[AS_HELP_STRING([--with-hsa-runtime-include=PATH],
+		[specify directory for installed HSA run-time include files])])
+AC_ARG_WITH(hsa-runtime-lib,
+	[AS_HELP_STRING([--with-hsa-runtime-lib=PATH],
+		[specify directory for the installed HSA run-time library])])
+if test "x$with_hsa_runtime" != x; then
+  HSA_RUNTIME_INCLUDE=$with_hsa_runtime/include
+  HSA_RUNTIME_LIB=$with_hsa_runtime/lib
+fi
+if test "x$with_hsa_runtime_include" != x; then
+  HSA_RUNTIME_INCLUDE=$with_hsa_runtime_include
+fi
+if test "x$with_hsa_runtime_lib" != x; then
+  HSA_RUNTIME_LIB=$with_hsa_runtime_lib
+fi
+if test "x$HSA_RUNTIME_INCLUDE" != x; then
+  HSA_RUNTIME_CPPFLAGS=-I$HSA_RUNTIME_INCLUDE
+fi
+if test "x$HSA_RUNTIME_LIB" != x; then
+  HSA_RUNTIME_LDFLAGS=-L$HSA_RUNTIME_LIB
+fi
+
+PLUGIN_HSA=0
+PLUGIN_HSA_CPPFLAGS=
+PLUGIN_HSA_LDFLAGS=
+PLUGIN_HSA_LIBS=
+AC_SUBST(PLUGIN_HSA)
+AC_SUBST(PLUGIN_HSA_CPPFLAGS)
+AC_SUBST(PLUGIN_HSA_LDFLAGS)
+AC_SUBST(PLUGIN_HSA_LIBS)
+
+
+
 # Get offload targets and path to install tree of offloading compiler.
 offload_additional_options=
 offload_additional_lib_paths=
@@ -122,6 +170,49 @@ if test x"$enable_offload_targets" != x; then
 	    ;;
 	esac
 	;;
+      hsa*)
+	case "${target}" in
+	  x86_64-*-*)
+	    case " ${CC} ${CFLAGS} " in
+	      *" -m32 "*)
+	        PLUGIN_HSA=0
+		;;
+	      *)
+	        tgt_name=hsa
+	        PLUGIN_HSA=$tgt
+	        PLUGIN_HSA_CPPFLAGS=$HSA_RUNTIME_CPPFLAGS
+	        PLUGIN_HSA_LDFLAGS=$HSA_RUNTIME_LDFLAGS
+	        PLUGIN_HSA_LIBS="-lhsa-runtime64 -lhsakmt"
+
+	        PLUGIN_HSA_save_CPPFLAGS=$CPPFLAGS
+	        CPPFLAGS="$PLUGIN_HSA_CPPFLAGS $CPPFLAGS"
+	        PLUGIN_HSA_save_LDFLAGS=$LDFLAGS
+	        LDFLAGS="$PLUGIN_HSA_LDFLAGS $LDFLAGS"
+	        PLUGIN_HSA_save_LIBS=$LIBS
+	        LIBS="$PLUGIN_HSA_LIBS $LIBS"
+
+	        AC_LINK_IFELSE(
+	          [AC_LANG_PROGRAM(
+	            [#include "hsa.h"],
+	              [hsa_status_t status = hsa_init ()])],
+	          [PLUGIN_HSA=1])
+	        CPPFLAGS=$PLUGIN_HSA_save_CPPFLAGS
+	        LDFLAGS=$PLUGIN_HSA_save_LDFLAGS
+	        LIBS=$PLUGIN_HSA_save_LIBS
+	        case $PLUGIN_HSA in
+	          hsa*)
+	            HSA_PLUGIN=0
+	            AC_MSG_ERROR([HSA run-time package required for HSA support])
+	            ;;
+	        esac
+		;;
+	      esac
+    	    ;;
+	  *-*-*)
+	    PLUGIN_HSA=0
+            ;;
+        esac
+        ;;
       *)
 	AC_MSG_ERROR([unknown offload target specified])
 	;;
@@ -145,3 +236,6 @@ AC_DEFINE_UNQUOTED(OFFLOAD_TARGETS, "$offload_targets",
 AM_CONDITIONAL([PLUGIN_NVPTX], [test $PLUGIN_NVPTX = 1])
 AC_DEFINE_UNQUOTED([PLUGIN_NVPTX], [$PLUGIN_NVPTX],
   [Define to 1 if the NVIDIA plugin is built, 0 if not.])
+AM_CONDITIONAL([PLUGIN_HSA], [test $PLUGIN_HSA = 1])
+AC_DEFINE_UNQUOTED([PLUGIN_HSA], [$PLUGIN_HSA],
+  [Define to 1 if the HSA plugin is built, 0 if not.])
diff -rup src/gcc/config.in hsa/gcc/config.in
--- src/gcc/config.in	2015-10-27 17:58:04.641572991 +0100
+++ hsa/gcc/config.in	2015-10-29 17:24:42.119799407 +0100
@@ -151,6 +151,12 @@
 #endif
 
 
+/* Define this to enable support for generating HSAIL. */
+#ifndef USED_FOR_TARGET
+#undef ENABLE_HSA
+#endif
+
+
 /* Define if gcc should always pass --build-id to linker. */
 #ifndef USED_FOR_TARGET
 #undef ENABLE_LD_BUILDID

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 2/12] Modifications to libgomp proper
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
  2015-11-05 21:53 ` [hsa 1/12] Configuration and offloading-related changes Martin Jambor
@ 2015-11-05 21:54 ` Martin Jambor
  2015-11-12 10:11   ` Jakub Jelinek
  2015-11-05 21:56 ` [hsa 3/12] HSA libgomp plugin Martin Jambor
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 21:54 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek

Hi,

The patch below contains all changes to libgomp files.  First, it adds
a new constant identifying HSA devices and a structure that is shared
between libgomp and the compiler when kernels from kernels are invoked
via dynamic parallelism.

Second it modifies the GOMP_target_41 function so that it also can take
kernel attributes (essentially the grid dimension) as a parameter and
pass it on the HSA libgomp plugin.  Because we do want HSAIL
generation to gracefully fail and use host fallback in that case, the
same function calls the host implementation if it cannot map the
requested function to an accelerated one or of a new callback
can_run_func indicates there is a problem.

We need a new hook because we use it to check for linking errors which
we cannot do when incrementally loading registered images.  And we
want to handle linking errors, so that when we cannot emit HSAIL for a
function called from a kernel (possibly in a different compilation
unit), we also resort to host fallback.

Last but not least, the patch removes data remapping when the selected
device is capable of sharing memory with the host.

Thanks,

Martin


2015-11-02  Martin Jambor  <mjambor@suse.cz>
	    Martin Liska  <mliska@suse.cz>

include/
	* gomp-constants.h (GOMP_DEVICE_HSA): New macro.

libgomp/
	* libgomp-plugin.h (offload_target_type): New element
	OFFLOAD_TARGET_TYPE_HSA.
	* libgomp.h (gomp_device_descr): Extra parameter of run_func, new
	field can_run_func.
	* libgomp_g.h (GOMP_target_41): Add an extra parameter.
	* oacc-host.c (host_run): Add an extra unused parameter.
	* target.c (gomp_get_target_fn_addr): Allow failure if device shares
	memory.
	(GOMP_target): Assert failure did not happen.  Add extra parameter to
	call of run_func.
	(GOMP_target_41): Added an extra parameter, pass it to run_func.
	Allow host fallback if device shares memory.  Do not remap data if
	device has shared memory.
	(GOMP_target_data_41): Run host fallback if device has shared memory.
	(gomp_load_plugin_for_device): Also attempt to load can_run_func.


diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index f834dec..46d52b3 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -160,6 +160,7 @@ enum gomp_map_kind
 #define GOMP_DEVICE_NOT_HOST		4
 #define GOMP_DEVICE_NVIDIA_PTX		5
 #define GOMP_DEVICE_INTEL_MIC		6
+#define GOMP_DEVICE_HSA			7
 
 #define GOMP_DEVICE_ICV			-1
 #define GOMP_DEVICE_HOST_FALLBACK	-2
@@ -212,4 +213,35 @@ enum gomp_map_kind
 #define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
 #define GOMP_LAUNCH_OP_MAX 0xffff
 
+/* HSA specific data structures.  */
+
+/* HSA kernel dispatch is collection of information needed for
+   a kernel dispatch.  */
+
+struct hsa_kernel_dispatch
+{
+  /* Pointer to a command queue associated with a kernel dispatch agent.  */
+  void *queue;
+  /* Pointer to reserved memory for OMP data struct copying.  */
+  void *omp_data_memory;
+  /* Pointer to a memory space used for kernel arguments passing.  */
+  void *kernarg_address;
+  /* Kernel object.  */
+  uint64_t object;
+  /* Synchronization signal used for dispatch synchronization.  */
+  uint64_t signal;
+  /* Private segment size.  */
+  uint32_t private_segment_size;
+  /* Group segment size.  */
+  uint32_t group_segment_size;
+  /* Number of children kernel dispatches.  */
+  uint64_t kernel_dispatch_count;
+  /* Number of threads.  */
+  uint32_t omp_num_threads;
+  /* Debug purpose argument.  */
+  uint64_t debug;
+  /* Kernel dispatch structures created for children kernel dispatches.  */
+  struct hsa_kernel_dispatch **children_dispatches;
+};
+
 #endif
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 24fbb94..acf6eb7 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -48,7 +48,8 @@ enum offload_target_type
   OFFLOAD_TARGET_TYPE_HOST = 2,
   /* OFFLOAD_TARGET_TYPE_HOST_NONSHM = 3 removed.  */
   OFFLOAD_TARGET_TYPE_NVIDIA_PTX = 5,
-  OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
+  OFFLOAD_TARGET_TYPE_INTEL_MIC = 6,
+  OFFLOAD_TARGET_TYPE_HSA = 7
 };
 
 /* Auxiliary struct, used for transferring pairs of addresses from plugin
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 9c8b1fb..0ad42d2 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -876,7 +876,8 @@ struct gomp_device_descr
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void *(*dev2dev_func) (int, void *, const void *, size_t);
-  void (*run_func) (int, void *, void *);
+  void (*run_func) (int, void *, void *, const void *);
+  bool (*can_run_func) (void *);
 
   /* Splay tree containing information about mapped memory regions.  */
   struct splay_tree_s mem_map;
diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h
index c28ad21..adb9bcc 100644
--- a/libgomp/libgomp_g.h
+++ b/libgomp/libgomp_g.h
@@ -250,7 +250,8 @@ extern void GOMP_single_copy_end (void *);
 extern void GOMP_target (int, void (*) (void *), const void *,
 			 size_t, void **, size_t *, unsigned char *);
 extern void GOMP_target_41 (int, void (*) (void *), size_t, void **, size_t *,
-			  unsigned short *, unsigned int, void **);
+			    unsigned short *, unsigned int, void **,
+			    const void *);
 extern void GOMP_target_data (int, const void *,
 			      size_t, void **, size_t *, unsigned char *);
 extern void GOMP_target_data_41 (int, size_t, void **, size_t *,
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 8e4ba04..c0c4d52 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -123,7 +123,8 @@ host_host2dev (int n __attribute__ ((unused)),
 }
 
 static void
-host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars)
+host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars,
+	  const void* kern_launch __attribute__ ((unused)))
 {
   void (*fn)(void *) = (void (*)(void *)) fn_ptr;
 
diff --git a/libgomp/target.c b/libgomp/target.c
index b767410..404faa4 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1248,7 +1248,12 @@ gomp_get_target_fn_addr (struct gomp_device_descr *devicep,
       splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
       gomp_mutex_unlock (&devicep->lock);
       if (tgt_fn == NULL)
-	gomp_fatal ("Target function wasn't mapped");
+	{
+	  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+	    return NULL;
+	  else
+	    gomp_fatal ("Target function wasn't mapped");
+	}
 
       return (void *) tgt_fn->tgt_offset;
     }
@@ -1276,6 +1281,7 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
     return gomp_target_fallback (fn, hostaddrs);
 
   void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
+  assert (fn_addr);
 
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
@@ -1288,7 +1294,8 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
       thr->place = old_thr.place;
       thr->ts.place_partition_len = gomp_places_list_len;
     }
-  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
+  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start,
+		     NULL);
   gomp_free_thread (thr);
   *thr = old_thr;
   gomp_unmap_vars (tgt_vars, true);
@@ -1297,7 +1304,7 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
 void
 GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
 		void **hostaddrs, size_t *sizes, unsigned short *kinds,
-		unsigned int flags, void **depend)
+		unsigned int flags, void **depend, const void *kernel_launch)
 {
   struct gomp_device_descr *devicep = resolve_device (device);
 
@@ -1312,8 +1319,16 @@ GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
 	gomp_task_maybe_wait_for_dependencies (depend);
     }
 
+  void *fn_addr = NULL;
+  bool host_fallback = false;
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || !(fn_addr = gomp_get_target_fn_addr (devicep, fn))
+      || (devicep->can_run_func && !devicep->can_run_func (fn_addr)))
+    host_fallback = true;
+
+  if (host_fallback
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     {
       size_t i, tgt_align = 0, tgt_size = 0;
       char *tgt = NULL;
@@ -1343,15 +1358,20 @@ GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
 		tgt_size = tgt_size + sizes[i];
 	      }
 	}
-      gomp_target_fallback (fn, hostaddrs);
-      return;
-    }
 
-  void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
+      if (host_fallback)
+	{
+	  gomp_target_fallback (fn, hostaddrs);
+	  return;
+	}
+    }
 
-  struct target_mem_desc *tgt_vars
-    = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		     GOMP_MAP_VARS_TARGET);
+  struct target_mem_desc *tgt_vars;
+  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+    tgt_vars = NULL;
+  else
+    tgt_vars = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds,
+			      true, GOMP_MAP_VARS_TARGET);
   struct gomp_thread old_thr, *thr = gomp_thread ();
   old_thr = *thr;
   memset (thr, '\0', sizeof (*thr));
@@ -1360,10 +1380,13 @@ GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
       thr->place = old_thr.place;
       thr->ts.place_partition_len = gomp_places_list_len;
     }
-  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
+  devicep->run_func (devicep->target_id, fn_addr,
+		     tgt_vars ? (void *) tgt_vars->tgt_start : hostaddrs,
+		     kernel_launch);
   gomp_free_thread (thr);
   *thr = old_thr;
-  gomp_unmap_vars (tgt_vars, true);
+  if (tgt_vars)
+    gomp_unmap_vars (tgt_vars, true);
 }
 
 /* Host fallback for GOMP_target_data{,_41} routines.  */
@@ -1393,6 +1416,7 @@ GOMP_target_data (int device, const void *unused, size_t mapnum,
   struct gomp_device_descr *devicep = resolve_device (device);
 
   if (devicep == NULL
+      || (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
     return gomp_target_data_fallback ();
 
@@ -2112,6 +2136,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
     {
       DLSYM (run);
+      DLSYM_OPT (can_run, can_run);
       DLSYM (dev2dev);
     }
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 3/12] HSA libgomp plugin
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
  2015-11-05 21:53 ` [hsa 1/12] Configuration and offloading-related changes Martin Jambor
  2015-11-05 21:54 ` [hsa 2/12] Modifications to libgomp proper Martin Jambor
@ 2015-11-05 21:56 ` Martin Jambor
  2015-11-05 22:47   ` Joseph Myers
  2015-11-05 21:57 ` [hsa 4/12] OpenMP lowering/expansion changes (gridification) Martin Jambor
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 21:56 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek

Hi,

the patch below adds the HSA-specific plugin for libgomp.  The plugin
implements the interface mandated by libgomp and takes care of finding
any available HSA devices, finalizing HSAIL code and running it on
HSA-capable GPUs.  The plugin does not really implement any data
movement functions (it implements them with a fatal error call)
because memory is shared in HSA environments and the previous patch
has modified libgomp proper not to call those functions on devices
with this capability.

When going over the code for the last time, I realized I did not
implement any version checks in the plugin yet, but did not want to
hold off the initial review round because of that.  I will have a look
at it as soon as possible.

Thanks,

Martin


2015-11-05  Martin Jambor  <mjambor@suse.cz>
	    Martin Liska  <mliska@suse.cz>

	* plugin/plugin-hsa.c: New file.



diff --git a/libgomp/plugin/plugin-hsa.c b/libgomp/plugin/plugin-hsa.c
new file mode 100644
index 0000000..c1b7879
--- /dev/null
+++ b/libgomp/plugin/plugin-hsa.c
@@ -0,0 +1,1309 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <pthread.h>
+#include "libgomp-plugin.h"
+#include "gomp-constants.h"
+#include "hsa.h"
+#include "hsa_ext_finalize.h"
+#include "dlfcn.h"
+
+/* Part of the libgomp plugin interface.  Return the name of the accelerator,
+   which is "hsa".  */
+
+const char *
+GOMP_OFFLOAD_get_name (void)
+{
+  return "hsa";
+}
+
+/* Part of the libgomp plugin interface.  Return the specific capabilities the
+   HSA accelerator have.  */
+
+unsigned int
+GOMP_OFFLOAD_get_caps (void)
+{
+  return GOMP_OFFLOAD_CAP_SHARED_MEM | GOMP_OFFLOAD_CAP_OPENMP_400;
+}
+
+/* Part of the libgomp plugin interface.  Identify as HSA accelerator.  */
+
+int
+GOMP_OFFLOAD_get_type (void)
+{
+  return OFFLOAD_TARGET_TYPE_HSA;
+}
+
+/* Return the libgomp version number we're compatible with.  There is
+   no requirement for cross-version compatibility.  */
+
+unsigned
+GOMP_OFFLOAD_version (void)
+{
+  return GOMP_VERSION;
+}
+
+/* Flag to decide whether print to stderr information about what is going on.
+   Set in init_debug depending on environment variables.  */
+
+static bool debug;
+
+/* Flag to decide if the runtime should suppress a possible fallback to host
+   execution.  */
+
+static bool suppress_host_fallback;
+
+/* Initialize debug and suppress_host_fallback according to the environment.  */
+
+static void
+init_enviroment_variables (void)
+{
+  if (getenv ("HSA_DEBUG"))
+    debug = true;
+  else
+    debug = false;
+
+  if (getenv ("HSA_SUPPRESS_HOST_FALLBACK"))
+    suppress_host_fallback = true;
+  else
+    suppress_host_fallback = false;
+}
+
+/* Print a logging message with PREFIX to stderr if HSA_DEBUG value
+   is set to true.  */
+
+#define HSA_LOG(prefix, ...) \
+  do \
+  { \
+    if (debug) \
+      { \
+	fprintf (stderr, prefix); \
+	fprintf (stderr, __VA_ARGS__); \
+      } \
+  } \
+  while (false);
+
+/* Print a debugging message to stderr.  */
+
+#define HSA_DEBUG(...) HSA_LOG ("HSA debug: ", __VA_ARGS__)
+
+/* Print a warning message to stderr.  */
+
+#define HSA_WARNING(...) HSA_LOG ("HSA warning: ", __VA_ARGS__)
+
+/* Print HSA warning STR with an HSA STATUS code.  */
+
+static void
+hsa_warn (const char *str, hsa_status_t status)
+{
+  if (!debug)
+    return;
+
+  const char* hsa_error;
+  hsa_status_string (status, &hsa_error);
+
+  unsigned l = strlen (hsa_error);
+
+  char *err = GOMP_PLUGIN_malloc (sizeof (char) * l);
+  memcpy (err, hsa_error, l - 1);
+  err[l] = '\0';
+
+  fprintf (stderr, "HSA warning: %s (%s)\n", str, err);
+
+  free (err);
+}
+
+/* Report a fatal error STR together with the HSA error corresponding to STATUS
+   and terminate execution of the current process.  */
+
+static void
+hsa_fatal (const char *str, hsa_status_t status)
+{
+  const char* hsa_error;
+  hsa_status_string (status, &hsa_error);
+  GOMP_PLUGIN_fatal ("HSA fatal error: %s (%s)", str, hsa_error);
+}
+
+struct hsa_kernel_description
+{
+  const char *name;
+  unsigned omp_data_size;
+  unsigned kernel_dependencies_count;
+  const char **kernel_dependencies;
+};
+
+/* Data passed by the static initializer of a compilation unit containing BRIG
+   to GOMP_offload_register.  */
+
+struct brig_image_desc
+{
+  hsa_ext_module_t brig_module;
+  const unsigned kernel_count;
+  struct hsa_kernel_description *kernel_infos;
+};
+
+struct agent_info;
+
+/* Information required to identify, finalize and run any given kernel.  */
+
+struct kernel_info
+{
+  /* Name of the kernel, required to locate it within the brig module.  */
+  const char *name;
+  /* Size of memory space for OMP data.  */
+  unsigned omp_data_size;
+  /* The specific agent the kernel has been or will be finalized for and run
+     on.  */
+  struct agent_info *agent;
+  /* The specific module where the kernel takes place.  */
+  struct module_info *module;
+  /* Mutex enforcing that at most once thread ever initializes a kernel for
+     use.  A thread should have locked agent->modules_rwlock for reading before
+     acquiring it.  */
+  pthread_mutex_t init_mutex;
+  /* Flag indicating whether the kernel has been initialized and all fields
+     below it contain valid data.  */
+  bool initialized;
+  /* Flag indicating that the kernel has a problem that blocks an execution.  */
+  bool initialization_failed;
+  /* The object to be put into the dispatch queue.  */
+  uint64_t object;
+  /* Required size of kernel arguments.  */
+  uint32_t kernarg_segment_size;
+  /* Required size of group segment.  */
+  uint32_t group_segment_size;
+  /* Required size of private segment.  */
+  uint32_t private_segment_size;
+  /* List of all kernel dependencies.  */
+  const char **dependencies;
+  /* Number of dependencies.  */
+  unsigned dependencies_count;
+  /* Maximum OMP data size necessary for kernel from kernel dispatches.  */
+  unsigned max_omp_data_size;
+};
+
+/* Information about a particular brig module, its image and kernels.  */
+
+struct module_info
+{
+  /* The next and previous module in the linked list of modules of an agent.  */
+  struct module_info *next, *prev;
+  /* The description with which the program has registered the image.  */
+  struct brig_image_desc *image_desc;
+
+  /* Number of kernels in this module.  */
+  int kernel_count;
+  /* An array of kernel_info structures describing each kernel in this
+     module.  */
+  struct kernel_info kernels[];
+};
+
+/* Information about shared brig library.  */
+
+struct brig_library_info
+{
+  char *file_name;
+  hsa_ext_module_t image;
+};
+
+/* Description of an HSA GPU agent and the program associated with it.  */
+
+struct agent_info
+{
+  /* The HSA ID of the agent.  Assigned when hsa_context is initialized.  */
+  hsa_agent_t id;
+  /* Whether the agent has been initialized.  The fields below are usable only
+     if it has been.  */
+  bool initialized;
+  /* The HSA ISA of this agent.  */
+  hsa_isa_t isa;
+  /* Command queue of the agent.  */
+  hsa_queue_t* command_q;
+  /* Kernel from kernel dispatch command queue.  */
+  hsa_queue_t* kernel_dispatch_command_q;
+  /* The HSA memory region from which to allocate kernel arguments.  */
+  hsa_region_t kernarg_region;
+
+  /* Read-write lock that protects kernels which are running or about to be run
+     from interference with loading and unloading of images.  Needs to be
+     locked for reading while a kernel is being run, and for writing if the
+     list of modules is manipulated (and thus the HSA program invalidated).  */
+  pthread_rwlock_t modules_rwlock;
+  /* The first module in a linked list of modules associated with this
+     kernel.  */
+  struct module_info *first_module;
+
+  /* Mutex enforcing that only one thread will finalize the HSA program.  A
+     thread should have locked agent->modules_rwlock for reading before
+     acquiring it.  */
+  pthread_mutex_t prog_mutex;
+  /* Flag whether the HSA program that consists of all the modules has been
+     finalized.  */
+  bool prog_finalized;
+  /* Flag whether the program was finalized but with a failure.  */
+  bool prog_finalized_error;
+  /* HSA executable - the finalized program that is used to locate kernels.  */
+  hsa_executable_t executable;
+  /* List of BRIG libraries.  */
+  struct brig_library_info **brig_libraries;
+  /* Number of loaded shared BRIG libraries.  */
+  unsigned brig_libraries_count;
+};
+
+/* Information about the whole HSA environment and all of its agents.  */
+
+struct hsa_context_info
+{
+  /* Whether the structure has been initialized.  */
+  bool initialized;
+  /* Number of usable GPU HSA agents in the system.  */
+  int agent_count;
+  /* Array of agent_info structures describing the individual HSA agents.  */
+  struct agent_info *agents;
+};
+
+/* Information about the whole HSA environment and all of its agents.  */
+
+static struct hsa_context_info hsa_context;
+
+/* Find kernel for an AGENT by name provided in KERNEL_NAME.  */
+
+static struct kernel_info *
+get_kernel_for_agent (struct agent_info *agent, const char *kernel_name)
+{
+  struct module_info *module = agent->first_module;
+
+  while (module)
+    {
+      for (unsigned i = 0; i < module->kernel_count; i++)
+	if (strcmp (module->kernels[i].name, kernel_name) == 0)
+	  return &module->kernels[i];
+
+      module = module->next;
+    }
+
+  return NULL;
+}
+
+/* Return true if the agent is a GPU and acceptable of concurrent submissions
+   from different threads.  */
+
+static bool
+suitable_hsa_agent_p (hsa_agent_t agent)
+{
+  hsa_device_type_t device_type;
+  hsa_status_t status = hsa_agent_get_info (agent, HSA_AGENT_INFO_DEVICE,
+					  &device_type);
+  if (status != HSA_STATUS_SUCCESS || device_type != HSA_DEVICE_TYPE_GPU)
+    return false;
+
+  uint32_t features = 0;
+  status = hsa_agent_get_info (agent, HSA_AGENT_INFO_FEATURE, &features);
+  if (status != HSA_STATUS_SUCCESS
+      || !(features & HSA_AGENT_FEATURE_KERNEL_DISPATCH))
+    return false;
+  hsa_queue_type_t queue_type;
+  status = hsa_agent_get_info (agent, HSA_AGENT_INFO_QUEUE_TYPE, &queue_type);
+  if (status != HSA_STATUS_SUCCESS
+      || (queue_type != HSA_QUEUE_TYPE_MULTI))
+    return false;
+
+  return true;
+}
+
+/* Callback of hsa_iterate_agents, if AGENT is a GPU device, increment
+   agent_count in hsa_context.  */
+
+static hsa_status_t
+count_gpu_agents (hsa_agent_t agent, void *data __attribute__ ((unused)))
+{
+  if (suitable_hsa_agent_p (agent))
+    hsa_context.agent_count++;
+  return HSA_STATUS_SUCCESS;
+}
+
+/* Callback of hsa_iterate_agents, if AGENT is a GPU device, assign the agent
+   id to the describing structure in the hsa context.  The index of the
+   structure is pointed to by DATA, increment it afterwards.  */
+
+static hsa_status_t
+assign_agent_ids (hsa_agent_t agent, void *data)
+{
+  if (suitable_hsa_agent_p (agent))
+    {
+      int *agent_index = (int *) data;
+      hsa_context.agents[*agent_index].id = agent;
+      ++*agent_index;
+    }
+  return HSA_STATUS_SUCCESS;
+}
+
+/* Initialize hsa_context if it has not already been done.  */
+
+static void
+init_hsa_context (void)
+{
+  hsa_status_t status;
+  int agent_index = 0;
+
+  if (hsa_context.initialized)
+    return;
+  init_enviroment_variables ();
+  status = hsa_init ();
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Run-time could not be initialized", status);
+  HSA_DEBUG ("HSA run-time initialized\n");
+  status = hsa_iterate_agents (count_gpu_agents, NULL);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("HSA GPU devices could not be enumerated", status);
+  HSA_DEBUG ("There are %i HSA GPU devices.\n", hsa_context.agent_count);
+
+  hsa_context.agents
+    = GOMP_PLUGIN_malloc_cleared (hsa_context.agent_count
+				  * sizeof (struct agent_info));
+  status = hsa_iterate_agents (assign_agent_ids, &agent_index);
+  if (agent_index != hsa_context.agent_count)
+    GOMP_PLUGIN_fatal ("Failed to assign IDs to all HSA agents");
+  hsa_context.initialized = true;
+}
+
+/* Callback of dispatch queues to report errors.  */
+
+static void
+queue_callback(hsa_status_t status, hsa_queue_t* queue __attribute__ ((unused)),
+	       void* data __attribute__ ((unused)))
+{
+  hsa_fatal ("Asynchronous queue error", status);
+}
+
+/* Callback of hsa_agent_iterate_regions.  Determine if a memory REGION can be
+   used for kernarg allocations and if so write it to the memory pointed to by
+   DATA and break the query.  */
+
+static hsa_status_t get_kernarg_memory_region (hsa_region_t region, void* data)
+{
+  hsa_status_t status;
+  hsa_region_segment_t segment;
+
+  status = hsa_region_get_info (region, HSA_REGION_INFO_SEGMENT, &segment);
+  if (status != HSA_STATUS_SUCCESS)
+    return status;
+  if (segment != HSA_REGION_SEGMENT_GLOBAL)
+    return HSA_STATUS_SUCCESS;
+
+  uint32_t flags;
+  status = hsa_region_get_info (region, HSA_REGION_INFO_GLOBAL_FLAGS, &flags);
+  if (status != HSA_STATUS_SUCCESS)
+    return status;
+  if (flags & HSA_REGION_GLOBAL_FLAG_KERNARG)
+    {
+      hsa_region_t* ret = (hsa_region_t*) data;
+      *ret = region;
+      return HSA_STATUS_INFO_BREAK;
+    }
+  return HSA_STATUS_SUCCESS;
+}
+
+/* Part of the libgomp plugin interface.  Return the number of HSA devices on
+   the system.  */
+
+int
+GOMP_OFFLOAD_get_num_devices (void)
+{
+  init_hsa_context ();
+  return hsa_context.agent_count;
+}
+
+/* Part of the libgomp plugin interface.  Initialize agent number N so that it
+   can be used for computation.  */
+
+void
+GOMP_OFFLOAD_init_device (int n)
+{
+  init_hsa_context ();
+  if (n >= hsa_context.agent_count)
+    GOMP_PLUGIN_fatal ("Request to initialize non-existing HSA device %i", n);
+  struct agent_info *agent = &hsa_context.agents[n];
+
+  if (agent->initialized)
+    return;
+
+  if (pthread_rwlock_init (&agent->modules_rwlock, NULL))
+    GOMP_PLUGIN_fatal ("Failed to initialize an HSA agent rwlock");
+  if (pthread_mutex_init (&agent->prog_mutex, NULL))
+    GOMP_PLUGIN_fatal ("Failed to initialize an HSA agent program mutex");
+
+  uint32_t queue_size;
+  hsa_status_t status;
+  status = hsa_agent_get_info (agent->id, HSA_AGENT_INFO_QUEUE_MAX_SIZE,
+			       &queue_size);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Error requesting maximum queue size of the HSA agent", status);
+  status = hsa_agent_get_info (agent->id, HSA_AGENT_INFO_ISA, &agent->isa);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Error querying the ISA of the agent", status);
+  status = hsa_queue_create (agent->id, queue_size, HSA_QUEUE_TYPE_MULTI,
+			     queue_callback, NULL, UINT32_MAX, UINT32_MAX,
+			     &agent->command_q);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Error creating command queue", status);
+
+  status = hsa_queue_create (agent->id, queue_size, HSA_QUEUE_TYPE_MULTI,
+			     queue_callback, NULL, UINT32_MAX, UINT32_MAX,
+			     &agent->kernel_dispatch_command_q);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Error creating kernel dispatch command queue", status);
+
+  agent->kernarg_region.handle = (uint64_t) -1;
+  status = hsa_agent_iterate_regions (agent->id, get_kernarg_memory_region,
+				      &agent->kernarg_region);
+  if (agent->kernarg_region.handle == (uint64_t) -1)
+    GOMP_PLUGIN_fatal ("Could not find suitable memory region for kernel "
+		       "arguments");
+  HSA_DEBUG ("HSA agent initialized, queue has id %llu\n",
+	     (long long unsigned) agent->command_q->id);
+  HSA_DEBUG ("HSA agent initialized, kernel dispatch queue has id %llu\n",
+	     (long long unsigned) agent->kernel_dispatch_command_q->id);
+  agent->initialized = true;
+}
+
+/* Verify that hsa_context has already been initialized and return the
+   agent_info structure describing device number N.  */
+
+static struct agent_info *
+get_agent_info (int n)
+{
+  if (!hsa_context.initialized)
+    GOMP_PLUGIN_fatal ("Attempt to use uninitialized HSA context.");
+  if (n >= hsa_context.agent_count)
+    GOMP_PLUGIN_fatal ("Request to operate on anon-existing HSA device %i", n);
+  if (!hsa_context.agents[n].initialized)
+    GOMP_PLUGIN_fatal ("Attempt to use an uninitialized HSA agent.");
+  return &hsa_context.agents[n];
+}
+
+/* Insert MODULE to the linked list of modules of AGENT.  */
+
+static void
+add_module_to_agent (struct agent_info *agent, struct module_info *module)
+{
+  if (agent->first_module)
+      agent->first_module->prev = module;
+  module->next = agent->first_module;
+  module->prev = NULL;
+  agent->first_module = module;
+}
+
+/* Remove MODULE from the linked list of modules of AGENT.  */
+
+static void
+remove_module_from_agent (struct agent_info *agent, struct module_info *module)
+{
+  if (agent->first_module == module)
+    agent->first_module = module->next;
+  if (module->prev)
+    module->prev->next = module->next;
+  if (module->next)
+    module->next->prev = module->prev;
+}
+
+/* Free the HSA program in agent and everything associated with it and set
+   agent->prog_finalized and the initialized flags of all kernels to false.  */
+
+static void
+destroy_hsa_program (struct agent_info *agent)
+{
+  if (!agent->prog_finalized || agent->prog_finalized_error)
+    return;
+
+  hsa_status_t status;
+
+  HSA_DEBUG ("Destroying the current HSA program.\n");
+
+  status = hsa_executable_destroy (agent->executable);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not destroy HSA executable", status);
+
+  struct module_info *module;
+  for (module = agent->first_module; module; module = module->next)
+    {
+      int i;
+      for (i = 0; i < module->kernel_count; i++)
+	module->kernels[i].initialized = false;
+    }
+  agent->prog_finalized = false;
+}
+
+/* Part of the libgomp plugin interface.  Load BRIG module described by struct
+   brig_image_desc in TARGET_DATA and return references to kernel descriptors
+   in TARGET_TABLE.  */
+
+/* FIXME: Start using some kind of versioning scheme too, I suppose.  */
+
+int
+GOMP_OFFLOAD_load_image (int ord, unsigned version  __attribute__ ((unused)),
+			 void *target_data, struct addr_pair **target_table)
+{
+  struct brig_image_desc *image_desc = (struct brig_image_desc *) target_data;
+  struct agent_info *agent;
+  struct addr_pair *pair;
+  struct module_info *module;
+  struct kernel_info *kernel;
+  int kernel_count = image_desc->kernel_count;
+
+  agent = get_agent_info (ord);
+  if (pthread_rwlock_wrlock (&agent->modules_rwlock))
+    GOMP_PLUGIN_fatal ("Unable to write-lock an HSA agent rwlock");
+  if (agent->prog_finalized)
+    destroy_hsa_program (agent);
+
+  HSA_DEBUG ("Encountered %d kernels in an image\n", kernel_count);
+  pair = GOMP_PLUGIN_malloc (kernel_count * sizeof (struct addr_pair));
+  *target_table = pair;
+  module = (struct module_info *)
+    GOMP_PLUGIN_malloc_cleared (sizeof (struct module_info)
+				+ kernel_count * sizeof (struct kernel_info));
+  module->image_desc = image_desc;
+  module->kernel_count = kernel_count;
+
+  kernel = &module->kernels[0];
+
+  /* Allocate memory for kernel dependencies.  */
+  for (unsigned i = 0; i < kernel_count; i++)
+    {
+      pair->start = (uintptr_t) kernel;
+      pair->end = (uintptr_t) (kernel + 1);
+
+      struct hsa_kernel_description *d = &image_desc->kernel_infos[i];
+      kernel->agent = agent;
+      kernel->module = module;
+      kernel->name = d->name;
+      kernel->omp_data_size = d->omp_data_size;
+      kernel->dependencies_count = d->kernel_dependencies_count;
+      kernel->dependencies = d->kernel_dependencies;
+      if (pthread_mutex_init (&kernel->init_mutex, NULL))
+	GOMP_PLUGIN_fatal ("Failed to initialize an HSA kernel mutex");
+
+      kernel++;
+      pair++;
+    }
+
+  add_module_to_agent (agent, module);
+  if (pthread_rwlock_unlock (&agent->modules_rwlock))
+    GOMP_PLUGIN_fatal ("Unable to unlock an HSA agent rwlock");
+  return kernel_count;
+}
+
+/* Add a shared BRIG library from a FILE_NAME to an AGENT.  */
+
+static struct brig_library_info *
+add_shared_library (const char *file_name, struct agent_info *agent)
+{
+  struct brig_library_info *library = NULL;
+
+  void *f = dlopen (file_name, RTLD_NOW);
+  void *start = dlsym (f, "__brig_start");
+  void *end = dlsym (f, "__brig_end");
+
+  if (start == NULL || end == NULL)
+    return NULL;
+
+  unsigned size = end - start;
+  char *buf = (char *) malloc (size);
+  memcpy (buf, start, size);
+
+  library = GOMP_PLUGIN_malloc (sizeof (struct agent_info));
+  library->file_name = (char *) GOMP_PLUGIN_malloc
+    ((strlen (file_name) + 1) * sizeof (char));
+  strcpy (library->file_name, file_name);
+  library->image = (hsa_ext_module_t) buf;
+
+  return library;
+}
+
+/* Release memory used for BRIG shared libraries that correspond
+   to an AGENT.  */
+
+static void
+release_agent_shared_libraries (struct agent_info *agent)
+{
+  for (unsigned i = 0; i < agent->brig_libraries_count; i++)
+    if (agent->brig_libraries[i])
+      {
+	free (agent->brig_libraries[i]->file_name);
+	free (agent->brig_libraries[i]->image);
+	free (agent->brig_libraries[i]);
+      }
+
+  free (agent->brig_libraries);
+}
+
+/* Create and finalize the program consisting of all loaded modules.  */
+
+static void
+create_and_finalize_hsa_program (struct agent_info *agent)
+{
+  hsa_status_t status;
+  hsa_ext_program_t prog_handle;
+  int mi = 0;
+
+  if (pthread_mutex_lock (&agent->prog_mutex))
+    GOMP_PLUGIN_fatal ("Could not lock an HSA agent program mutex");
+  if (agent->prog_finalized)
+    goto final;
+
+  status = hsa_ext_program_create (HSA_MACHINE_MODEL_LARGE, HSA_PROFILE_FULL,
+				   HSA_DEFAULT_FLOAT_ROUNDING_MODE_DEFAULT,
+				   NULL, &prog_handle);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not create an HSA program", status);
+
+  HSA_DEBUG ("Created a finalized program\n");
+
+  struct module_info *module = agent->first_module;
+  while (module)
+    {
+      status = hsa_ext_program_add_module (prog_handle,
+					   module->image_desc->brig_module);
+      if (status != HSA_STATUS_SUCCESS)
+	hsa_fatal ("Could not add a module to the HSA program", status);
+      module = module->next;
+      mi++;
+    }
+
+  /* Load all shared libraries.  */
+  const char *libraries[] = { "libhsamath.so", "libhsastd.so" };
+  const unsigned libraries_count = sizeof (libraries) / sizeof (const char *);
+
+  agent->brig_libraries_count = libraries_count;
+  agent->brig_libraries = GOMP_PLUGIN_malloc_cleared
+    (sizeof (struct brig_library_info) * libraries_count);
+
+  for (unsigned i = 0; i < libraries_count; i++)
+    {
+      struct brig_library_info *library = add_shared_library (libraries[i],
+							      agent);
+      if (library == NULL)
+	{
+	  HSA_WARNING ("Could not open a shared BRIG library: %s\n",
+		       libraries[i]);
+	  continue;
+	}
+
+      status = hsa_ext_program_add_module (prog_handle, library->image);
+      if (status != HSA_STATUS_SUCCESS)
+	hsa_warn ("Could not add a shared BRIG library the HSA program",
+		  status);
+      else
+	HSA_DEBUG ("a shared BRIG library has been added to a program: %s\n",
+		   libraries[i]);
+    }
+
+  hsa_ext_control_directives_t control_directives;
+  memset (&control_directives, 0, sizeof (control_directives));
+  hsa_code_object_t code_object;
+  status = hsa_ext_program_finalize(prog_handle, agent->isa,
+				    HSA_EXT_FINALIZER_CALL_CONVENTION_AUTO,
+				    control_directives, "",
+				    HSA_CODE_OBJECT_TYPE_PROGRAM,
+				    &code_object);
+  if (status != HSA_STATUS_SUCCESS)
+    {
+      hsa_warn ("Finalization of the HSA program failed", status);
+      goto failure;
+    }
+
+  HSA_DEBUG ("Finalization done\n");
+  hsa_ext_program_destroy (prog_handle);
+
+  status = hsa_executable_create(HSA_PROFILE_FULL, HSA_EXECUTABLE_STATE_UNFROZEN,
+				 "", &agent->executable);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not create HSA executable", status);
+
+  status = hsa_executable_load_code_object(agent->executable, agent->id,
+					   code_object, "");
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not add a code object to the HSA executable", status);
+  status = hsa_executable_freeze(agent->executable, "");
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not freeze the HSA executable", status);
+
+  HSA_DEBUG ("Froze HSA executable with the finalized code object\n");
+
+  /* If all goes good, jump to final.  */
+  goto final;
+
+failure:
+  release_agent_shared_libraries (agent);
+  agent->prog_finalized_error = true;
+
+final:
+  agent->prog_finalized = true;
+
+  if (pthread_mutex_unlock (&agent->prog_mutex))
+    GOMP_PLUGIN_fatal ("Could not unlock an HSA agent program mutex");
+}
+
+/* Create kernel dispatch data structure for given KERNEL.  */
+
+static struct hsa_kernel_dispatch *
+create_single_kernel_dispatch (struct kernel_info *kernel,
+			       unsigned omp_data_size)
+{
+  struct agent_info *agent = kernel->agent;
+  struct hsa_kernel_dispatch *shadow = GOMP_PLUGIN_malloc_cleared
+    (sizeof (struct hsa_kernel_dispatch));
+
+  shadow->queue = agent->command_q;
+  shadow->omp_data_memory = omp_data_size > 0
+    ? GOMP_PLUGIN_malloc (omp_data_size) : NULL;
+  unsigned dispatch_count = kernel->dependencies_count;
+  shadow->kernel_dispatch_count = dispatch_count;
+
+  shadow->children_dispatches = GOMP_PLUGIN_malloc
+    (dispatch_count * sizeof (struct hsa_kernel_dispatch *));
+
+  shadow->object = kernel->object;
+
+  hsa_signal_t sync_signal;
+  hsa_status_t status = hsa_signal_create (1, 0, NULL, &sync_signal);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Error creating the HSA sync signal", status);
+
+  shadow->signal = sync_signal.handle;
+  shadow->private_segment_size = kernel->private_segment_size;
+  shadow->group_segment_size = kernel->group_segment_size;
+
+  status = hsa_memory_allocate
+    (agent->kernarg_region, kernel->kernarg_segment_size,
+     &shadow->kernarg_address);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not allocate memory for HSA kernel arguments", status);
+
+  return shadow;
+}
+
+/* Release data structure created for a kernel dispatch in SHADOW argument.  */
+
+static void
+release_kernel_dispatch (struct hsa_kernel_dispatch *shadow)
+{
+  HSA_DEBUG ("Released kernel dispatch: %p has value: %lu (%p)\n",
+	     shadow, shadow->debug, (void *)shadow->debug);
+
+  hsa_memory_free (shadow->kernarg_address);
+
+  hsa_signal_t s;
+  s.handle = shadow->signal;
+  hsa_signal_destroy (s);
+
+  free (shadow->omp_data_memory);
+
+  for (unsigned i = 0; i < shadow->kernel_dispatch_count; i++)
+    release_kernel_dispatch (shadow->children_dispatches[i]);
+
+  free (shadow->children_dispatches);
+  free (shadow);
+}
+
+/* Initialize a KERNEL without its dependencies.  MAX_OMP_DATA_SIZE is used
+   to calculate maximum necessary memory for OMP data allocation.  */
+
+static void
+init_single_kernel (struct kernel_info *kernel, unsigned *max_omp_data_size)
+{
+  hsa_status_t status;
+  struct agent_info *agent = kernel->agent;
+  hsa_executable_symbol_t kernel_symbol;
+  status = hsa_executable_get_symbol (agent->executable, NULL, kernel->name,
+				      agent->id, 0, &kernel_symbol);
+  if (status != HSA_STATUS_SUCCESS)
+    {
+      hsa_warn ("Could not find symbol for kernel in the code object", status);
+      goto failure;
+    }
+  HSA_DEBUG ("Located kernel %s\n", kernel->name);
+  status = hsa_executable_symbol_get_info
+    (kernel_symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel->object);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not extract a kernel object from its symbol", status);
+  status = hsa_executable_symbol_get_info
+    (kernel_symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE,
+     &kernel->kernarg_segment_size);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not get info about kernel argument size", status);
+  status = hsa_executable_symbol_get_info
+    (kernel_symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE,
+     &kernel->group_segment_size);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not get info about kernel group segment size", status);
+  status = hsa_executable_symbol_get_info
+    (kernel_symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE,
+     &kernel->private_segment_size);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not get info about kernel private segment size",
+	       status);
+
+  HSA_DEBUG ("Kernel structure for %s fully initialized with "
+	     "following segment sizes: \n", kernel->name);
+  HSA_DEBUG ("  group_segment_size: %u\n",
+	     (unsigned) kernel->group_segment_size);
+  HSA_DEBUG ("  private_segment_size: %u\n",
+	     (unsigned) kernel->private_segment_size);
+  HSA_DEBUG ("  kernarg_segment_size: %u\n",
+	     (unsigned) kernel->kernarg_segment_size);
+  HSA_DEBUG ("  omp_data_size: %u\n", kernel->omp_data_size);
+
+  if (kernel->omp_data_size > *max_omp_data_size)
+    *max_omp_data_size = kernel->omp_data_size;
+
+  for (unsigned i = 0; i < kernel->dependencies_count; i++)
+    {
+      struct kernel_info *dependency = get_kernel_for_agent
+	(agent, kernel->dependencies[i]);
+
+      if (dependency == NULL)
+	{
+	  HSA_DEBUG ("Could not find a dependency for a kernel: %s, "
+		     "dependency name: %s\n", kernel->name,
+		     kernel->dependencies[i]);
+	  goto failure;
+	}
+
+      if (dependency->dependencies_count > 0)
+	{
+	  HSA_DEBUG ("HSA does not allow kernel dispatching code with "
+		     "a depth bigger than one\n")
+	  goto failure;
+	}
+
+      init_single_kernel (dependency, max_omp_data_size);
+    }
+
+  return;
+
+failure:
+  kernel->initialization_failed = true;
+}
+
+/* Indent stream F by INDENT spaces.  */
+
+static void
+indent_stream (FILE *f, unsigned indent)
+{
+  for (int i = 0; i < indent; i++)
+    fputc (' ', f);
+}
+
+/* Dump kernel DISPATCH data structure and indent it by INDENT spaces.  */
+
+static void
+print_kernel_dispatch (struct hsa_kernel_dispatch *dispatch, unsigned indent)
+{
+  indent_stream (stderr, indent);
+  fprintf (stderr, "this: %p\n", dispatch);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "queue: %p\n", dispatch->queue);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "omp_data_memory: %p\n", dispatch->omp_data_memory);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "kernarg_address: %p\n", dispatch->kernarg_address);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "object: %lu\n", dispatch->object);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "signal: %lu\n", dispatch->signal);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "private_segment_size: %u\n",
+	   dispatch->private_segment_size);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "group_segment_size: %u\n",
+	   dispatch->group_segment_size);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "children dispatches: %lu\n",
+	   dispatch->kernel_dispatch_count);
+  indent_stream (stderr, indent);
+  fprintf (stderr, "omp_num_threads: %u\n",
+	   dispatch->omp_num_threads);
+  fprintf (stderr, "\n");
+
+  for (unsigned i = 0; i < dispatch->kernel_dispatch_count; i++)
+      print_kernel_dispatch (dispatch->children_dispatches[i], indent + 2);
+}
+
+/* Create kernel dispatch data structure for a KERNEL and all its
+   dependencies.  */
+
+static struct hsa_kernel_dispatch *
+create_kernel_dispatch (struct kernel_info *kernel, unsigned omp_data_size)
+{
+  struct hsa_kernel_dispatch *shadow = create_single_kernel_dispatch
+    (kernel, omp_data_size);
+  shadow->omp_num_threads = 64;
+  shadow->debug = 0;
+
+  /* Create kernel dispatch data structures.  We do not allow to have
+     a kernel dispatch with depth bigger than one.  */
+  for (unsigned i = 0; i < kernel->dependencies_count; i++)
+    {
+      struct kernel_info *dependency = get_kernel_for_agent
+	(kernel->agent, kernel->dependencies[i]);
+      shadow->children_dispatches[i] = create_single_kernel_dispatch
+	(dependency, omp_data_size);
+      shadow->children_dispatches[i]->queue =
+	kernel->agent->kernel_dispatch_command_q;
+    }
+
+  return shadow;
+}
+
+/* Do all the work that is necessary before running KERNEL for the first time.
+   The function assumes the program has been created, finalized and frozen by
+   create_and_finalize_hsa_program.  */
+
+static void
+init_kernel (struct kernel_info *kernel)
+{
+  if (pthread_mutex_lock (&kernel->init_mutex))
+    GOMP_PLUGIN_fatal ("Could not lock an HSA kernel initialization mutex");
+  if (kernel->initialized)
+    {
+      if (pthread_mutex_unlock (&kernel->init_mutex))
+	GOMP_PLUGIN_fatal ("Could not unlock an HSA kernel initialization "
+			   "mutex");
+
+      return;
+    }
+
+  /* Precomputed maximum size of OMP data necessary for a kernel from kernel
+     dispatch operation.  */
+  init_single_kernel (kernel, &kernel->max_omp_data_size);
+
+  if (!kernel->initialization_failed)
+    HSA_DEBUG ("\n");
+
+  kernel->initialized = true;
+  if (pthread_mutex_unlock (&kernel->init_mutex))
+    GOMP_PLUGIN_fatal ("Could not unlock an HSA kernel initialization "
+		       "mutex");
+}
+
+/* Structure provided by the compiler, specifying the grid, sizes.  */
+
+struct kernel_launch_attributes
+{
+  /* Number of dimensions the workload has.  Maximum number is 3.  */
+  uint32_t ndim;
+  /* Size of the grid in the three respective dimensions.  */
+  uint32_t gdims[3];
+  /* Size of work-groups in the respective dimensions.  */
+  uint32_t wdims[3];
+};
+
+/* Parse the launch attributes INPUT provided by the compiler and return true
+   if we should run anything all.  If INPUT is NULL, fill DEF with default
+   values, then store INPUT or DEF into *RESULT.  */
+
+static bool
+parse_launch_attributes (const void *input,
+			 struct kernel_launch_attributes *def,
+			 const struct kernel_launch_attributes **result)
+{
+  if (!input)
+    {
+      def->ndim = 1;
+      def->gdims[0] = 1;
+      def->gdims[1] = 1;
+      def->gdims[2] = 1;
+      def->wdims[0] = 1;
+      def->wdims[1] = 1;
+      def->wdims[2] = 1;
+      *result = def;
+      HSA_DEBUG ("GOMP_OFFLOAD_run called with no launch attributes\n");
+      return true;
+    }
+
+  const struct kernel_launch_attributes *kla;
+  kla = (const struct kernel_launch_attributes *) input;
+  *result = kla;
+  if (kla->ndim != 1)
+    GOMP_PLUGIN_fatal ("HSA does not yet support number of dimensions "
+		       "different from one.");
+  if (kla->gdims[0] == 0)
+    return false;
+
+  HSA_DEBUG ("GOMP_OFFLOAD_run called with grid size %u and group size %u\n",
+	     kla->gdims[0], kla->wdims[0]);
+
+  return true;
+}
+
+/* Return true if the HSA runtime can run function FN_PTR.  */
+
+bool
+GOMP_OFFLOAD_can_run (void *fn_ptr)
+{
+  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
+  struct agent_info *agent = kernel->agent;
+  create_and_finalize_hsa_program (agent);
+
+  if (agent->prog_finalized_error)
+    goto failure;
+
+  init_kernel (kernel);
+  if (kernel->initialization_failed)
+    goto failure;
+
+  return true;
+
+failure:
+  if (suppress_host_fallback)
+    GOMP_PLUGIN_fatal ("HSA host fallback has been suppressed");
+  HSA_DEBUG ("HSA target cannot be launched, doing a host fallback\n");
+  return false;
+}
+
+/* Part of the libgomp plugin interface.  Run a kernel on a device N and pass
+   the it an array of pointers in VARS as a parameter.  The kernel is
+   identified by FN_PTR which must point to a kernel_info structure.  */
+
+void
+GOMP_OFFLOAD_run (int n, void *fn_ptr, void *vars, const void* kern_launch)
+{
+  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
+  struct agent_info *agent = kernel->agent;
+  struct kernel_launch_attributes def;
+  const struct kernel_launch_attributes *kla;
+  if (!parse_launch_attributes (kern_launch, &def, &kla))
+    {
+      HSA_DEBUG ("Will not run HSA kernel because the grid size is zero\n");
+      return;
+    }
+  if (pthread_rwlock_rdlock (&agent->modules_rwlock))
+    GOMP_PLUGIN_fatal ("Unable to read-lock an HSA agent rwlock");
+
+  if (!agent->initialized)
+    GOMP_PLUGIN_fatal ("Agent must be initialized");
+
+  if (!kernel->initialized)
+    GOMP_PLUGIN_fatal ("Called kernel must be initialized");
+
+  struct hsa_kernel_dispatch *shadow = create_kernel_dispatch
+    (kernel, kernel->max_omp_data_size);
+
+  if (debug)
+    {
+      fprintf (stderr, "\nKernel has following dependencies:\n");
+      print_kernel_dispatch (shadow, 2);
+    }
+
+  uint64_t index = hsa_queue_add_write_index_release (agent->command_q, 1);
+  HSA_DEBUG ("Got AQL index %llu\n", (long long int) index);
+
+  /* Wait until the queue is not full before writing the packet.   */
+  while (index - hsa_queue_load_read_index_acquire(agent->command_q)
+	 >= agent->command_q->size)
+    ;
+
+  hsa_kernel_dispatch_packet_t *packet;
+  packet = ((hsa_kernel_dispatch_packet_t*) agent->command_q->base_address)
+    + index % agent->command_q->size;
+
+  memset (((uint8_t *)packet) + 4, 0, sizeof (*packet) - 4);
+  packet->setup  |= (uint16_t) 1 << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+  packet->grid_size_x = kla->gdims[0];
+  uint32_t wgs = kla->wdims[0];
+  if (wgs == 0)
+    /* TODO: Provide a default via environment.  */
+    wgs = 64;
+  else if (wgs > kla->gdims[0])
+    wgs = kla->gdims[0];
+  packet->workgroup_size_x = wgs;
+  packet->grid_size_y = 1;
+  packet->workgroup_size_y = 1;
+  packet->grid_size_z = 1;
+  packet->workgroup_size_z = 1;
+  packet->private_segment_size = kernel->private_segment_size;
+  packet->group_segment_size = kernel->group_segment_size;
+  packet->kernel_object = kernel->object;
+  packet->kernarg_address = shadow->kernarg_address;
+  hsa_signal_t s;
+  s.handle = shadow->signal;
+  packet->completion_signal = s;
+  hsa_signal_store_relaxed (s, 1);
+  memcpy (shadow->kernarg_address, &vars, sizeof (vars));
+
+  memcpy (shadow->kernarg_address + sizeof (vars), &shadow,
+	  sizeof (struct hsa_kernel_runtime *));
+
+  HSA_DEBUG ("Copying kernel runtime pointer to kernarg_address\n");
+
+  uint16_t header;
+  header = HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE;
+  header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE;
+  header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;
+
+  HSA_DEBUG ("Going to dispatch kernel %s\n", kernel->name);
+
+  __atomic_store_n ((uint16_t*)(&packet->header), header, __ATOMIC_RELEASE);
+  hsa_signal_store_release (agent->command_q->doorbell_signal, index);
+
+  /* TODO: fixup, following workaround is necessary to run kernel from
+     kernel dispatch mechanism on a Carrizo machine.  */
+
+  for (unsigned i = 0; i < shadow->kernel_dispatch_count; i++)
+    {
+      hsa_signal_t child_s;
+      child_s.handle = shadow->children_dispatches[i]->signal;
+
+      HSA_DEBUG ("Waiting for children completion signal: %lu\n",
+		 shadow->children_dispatches[i]->signal);
+      while (hsa_signal_wait_acquire
+	     (child_s, HSA_SIGNAL_CONDITION_LT, 1, UINT64_MAX,
+	      HSA_WAIT_STATE_BLOCKED) != 0);
+    }
+
+  HSA_DEBUG ("Kernel dispatched, waiting for completion\n");
+  while (hsa_signal_wait_acquire (s, HSA_SIGNAL_CONDITION_LT, 1,
+				  UINT64_MAX, HSA_WAIT_STATE_BLOCKED) != 0);
+
+  release_kernel_dispatch (shadow);
+
+  if (pthread_rwlock_unlock (&agent->modules_rwlock))
+    GOMP_PLUGIN_fatal ("Unable to unlock an HSA agent rwlock");
+}
+
+/* Deinitialize all information associated with MODULE and kernels within
+   it.  */
+
+void
+destroy_module (struct module_info *module)
+{
+  int i;
+  for (i = 0; i < module->kernel_count; i++)
+    if (pthread_mutex_destroy (&module->kernels[i].init_mutex))
+      GOMP_PLUGIN_fatal ("Failed to destroy an HSA kernel initialization "
+			 "mutex");
+}
+
+/* Part of the libgomp plugin interface.  Unload BRIG module described by
+   struct brig_image_desc in TARGET_DATA from agent number N.  */
+
+/* FIXME: Like when loading an image, look at the version.  */
+
+void
+GOMP_OFFLOAD_unload_image (int n, unsigned version  __attribute__ ((unused)),
+			   void *target_data)
+{
+  struct agent_info *agent;
+  agent = get_agent_info (n);
+  if (pthread_rwlock_wrlock (&agent->modules_rwlock))
+    GOMP_PLUGIN_fatal ("Unable to write-lock an HSA agent rwlock");
+
+  struct module_info *module = agent->first_module;
+  while (module)
+    {
+      if (module->image_desc == target_data)
+	break;
+      module = module->next;
+    }
+  if (!module)
+    GOMP_PLUGIN_fatal ("Attempt to unload an image that has never been "
+		       "loaded before");
+
+  remove_module_from_agent (agent, module);
+  destroy_module (module);
+  free (module);
+  destroy_hsa_program (agent);
+  if (pthread_rwlock_unlock (&agent->modules_rwlock))
+    GOMP_PLUGIN_fatal ("Unable to unlock an HSA agent rwlock");
+}
+
+/* Part of the libgomp plugin interface.  Deinitialize all information and
+   status associated with agent number N.  We do not attempt any
+   synchronization, assuming the user and libgomp will not attempt
+   deinitialization of a device that is in any way being used at the same
+   time.  */
+
+void
+GOMP_OFFLOAD_fini_device (int n)
+{
+  struct agent_info *agent = get_agent_info (n);
+  if (!agent->initialized)
+    return;
+
+  struct module_info *next_module = agent->first_module;
+  while (next_module)
+    {
+      struct module_info *module = next_module;
+      next_module = module->next;
+      destroy_module (module);
+      free (module);
+    }
+  agent->first_module = NULL;
+  destroy_hsa_program (agent);
+
+  release_agent_shared_libraries (agent);
+
+  hsa_status_t status = hsa_queue_destroy (agent->command_q);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Error destroying command queue", status);
+  status = hsa_queue_destroy (agent->kernel_dispatch_command_q);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Error destroying kernel dispatch command queue", status);
+  if (pthread_mutex_destroy (&agent->prog_mutex))
+    GOMP_PLUGIN_fatal ("Failed to destroy an HSA agent program mutex");
+  if (pthread_rwlock_destroy (&agent->modules_rwlock))
+    GOMP_PLUGIN_fatal ("Failed to destroy an HSA agent rwlock");
+  agent->initialized = false;
+}
+
+/* Part of the libgomp plugin interface.  Not implemented as it is not required
+   for HSA.  */
+
+void *
+GOMP_OFFLOAD_alloc (int ord, size_t size)
+{
+  GOMP_PLUGIN_fatal ("HSA GOMP_OFFLOAD_alloc is not implemented because "
+		     "it should never be called");
+}
+
+/* Part of the libgomp plugin interface.  Not implemented as it is not required
+   for HSA.  */
+
+void
+GOMP_OFFLOAD_free (int ord, void *ptr)
+{
+  GOMP_PLUGIN_fatal ("HSA GOMP_OFFLOAD_free is not implemented because "
+		     "it should never be called");
+}
+
+/* Part of the libgomp plugin interface.  Not implemented as it is not required
+   for HSA.  */
+
+void *
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
+{
+  GOMP_PLUGIN_fatal ("HSA GOMP_OFFLOAD_dev2host is not implemented because "
+		     "it should never be called");
+}
+
+/* Part of the libgomp plugin interface.  Not implemented as it is not required
+   for HSA.  */
+
+void *
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
+{
+  GOMP_PLUGIN_fatal ("HSA GOMP_OFFLOAD_host2dev is not implemented because "
+		     "it should never be called");
+}
+
+/* Part of the libgomp plugin interface.  Not implemented as it is not required
+   for HSA.  */
+
+void *
+GOMP_OFFLOAD_dev2dev (int ord, void *dst, const void *src, size_t n)
+{
+  GOMP_PLUGIN_fatal ("HSA GOMP_OFFLOAD_dev2dev is not implemented because "
+		     "it should never be called");
+}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 4/12] OpenMP lowering/expansion changes (gridification)
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (2 preceding siblings ...)
  2015-11-05 21:56 ` [hsa 3/12] HSA libgomp plugin Martin Jambor
@ 2015-11-05 21:57 ` Martin Jambor
  2015-11-09 10:02   ` Martin Jambor
  2015-11-12 11:16   ` Jakub Jelinek
  2015-11-05 21:58 ` [hsa 5/12] New HSA-related GCC options Martin Jambor
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 21:57 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek

Hi,

the patch in this email contains the changes to make our OpenMP
lowering and expansion machinery produce GPU kernels for a certain
limited class of loops.  The plan is to make that class quite a big
bigger, but only the following is ready for submission now.

Basically, whenever the compiler configured for HSAIL generation
encounters the following pattern:

  #pragma omp target
  #pragma omp teams thread_limit(workgroup_size) // thread_limit is optional
  #pragma omp distribute parallel for firstprivate(n) private(i) other_sharing_clauses()
    for (i = 0; i < n; i++)
      some_loop_body

it creates a copy of the entire target body and expands it slightly
differently for concurrent execution on a GPU.  Note that both teams
and distribute constructs are mandatory.  Moreover, currently the
distribute has to be in a combined statement with the inner for
construct.  And there are quite a few other restrictions which I hope
to alleviate over the next year, most notably implement reductions.  A
few days ago I hoped to finish writing support for collapse(2) and
collapse(3) clauses in time for stage1 but now I am a bit sceptical.

The first phase of the "gridification" process is run before omp
"scanning" phase.  We look for the pattern above, and if we encounter
one, we copy its entire body into a new gimple statement
GIMPLE_OMP_GPUKERNEL.  Within it, we mark the teams, distribute and
parallel constructs with a new flag "kernel_phony."  This flag will
then make OMP lowering phase process their sharing clauses like usual,
but the statements representing the constructs will be removed at
lowering (and thus will never be expanded).  The resulting wasteful
repackaging of data is nicely cleaned by our optimizers even at -O1.

At expansion time, we identify gomp_target statements with a kernel
and expand the kernel into a special function, with the loop
represented by the GPU grid and not control flow.  Afterwards, the
normal body of the target is expanded as usual.  Finally, we need to
take the grid dimensions stored within new fields of the target
statement by the first phase, store in a structure and pass them to
libgomp in a new parameter of GOMP_target_41.

Originally, when I started with the above pattern matching, I did not
allow any other gimple statements in between the respective omp
constructs.  That however proved to be too restrictive for two
reasons.  First, statements in pre-bodies of both distribute and for
loops needed to be accounted for when calculating the kernel grid size
(which is done before the target statement itself) and second, Fortran
parameter dereferences happily result in interleaving statements when
there were none in the user source code.

Therefore, I allow register-type stores to local non-addressable
variables in pre-bodies and also in between the OMP constructs.  All
of them are copied in front of the target statement and either used
for grid size calculation or removed as useless by later
optimizations.

For convenience of anybody reviewing the code, I'm attaching a very
simple testcase with selection of dumps that illustrate the whole
process.

While we have also been experimenting quite a bit with dynamic
parallelism, we have only been able to achieve any good performance
via this process of gridification.  The user can be notified whether a
particular target construct was gridified or not via our process of
dumping notes, which however only appear in the detailed dump.  I am
seriously considering emitting some kind of warning, when HSA-enabled
compiler is about to produce a non-gridified target code.

I hope that eventually I managed to write the gridification in a way
that interferes very little with the rest of the OMP pipeline and yet
only re-implement the bare necessary minimum of functionality that is
already there.  I'll be grateful for any feedback regarding the
approach.

Thanks,

Martin


2015-11-05  Martin Jambor  <mjambor@suse.cz>

	* builtin-types.def (BT_FN_VOID_PTR_INT_PTR): New.
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR): Removed.
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR): New.
	* fortran/types.def (BT_FN_VOID_PTR_INT_PTR): New.
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR): Removed.
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR): New.
	* gimple-low.c (lower_stmt): Handle GIMPLE_OMP_GPUKERNEL.
	* gimple-pretty-print.c (dump_gimple_omp_for): Likewise.
	(dump_gimple_omp_block): Handle GF_OMP_FOR_KIND_KERNEL_BODY
	(pp_gimple_stmt_1): Handle GIMPLE_OMP_GPUKERNEL.
	* gimple-walk.c (walk_gimple_stmt): Likewise.
	* gimple.c (gimple_build_omp_gpukernel): New function.
	(gimple_omp_target_init_dimensions): Likewise.
	(gimple_copy): Handle GIMPLE_OMP_GPUKERNEL.
	* gimple.def (GIMPLE_OMP_TEAMS): Moved into its own layout.
	(GIMPLE_OMP_GPUKERNEL): New.
	* gimple.h (gf_mask): New element GF_OMP_FOR_KIND_KERNEL_BODY.
	(gomp_for): New field kernel_phony.
	(gimple_omp_target_grid_dim): New type.
	(gimple_statement_omp_parallel_layout): New fields dimensions,
	kernel_dim, kernel_phony.
	(gomp_teams): New field kernel_phony.
	(gimple_build_omp_gpukernel): Declare.
	(gimple_omp_target_init_dimensions): Likewise.
	(gimple_has_substatements): Handle GIMPLE_OMP_GPUKERNEL.
	(gimple_omp_for_kernel_phony): New function.
	(gimple_omp_for_set_kernel_phony): Likewise.
	(gimple_omp_parallel_kernel_phony): Likewise.
	(gimple_omp_parallel_set_kernel_phony): Likewise.
	(gimple_omp_target_dimensions): Likewise.
	(gimple_omp_target_grid_size): Likewise.
	(gimple_omp_target_grid_size_ptr): Likewise.
	(gimple_omp_target_set_grid_size): Likewise.
	(gimple_omp_target_workgroup_size): Likewise.
	(gimple_omp_target_workgroup_size_ptr): Likewise.
	(gimple_omp_target_set_workgroup_size): Likewise.
	(gimple_omp_teams_kernel_phony): Likewise.
	(gimple_omp_teams_set_kernel_phony): Likewise.
	(CASE_GIMPLE_OMP): Handle GIMPLE_OMP_GPUKERNEL.
	* gsstruct.def (GSS_OMP_TEAMS_LAYOUT): New.
	* omp-builtins.def (BUILT_IN_GOMP_OFFLOAD_REGISTER): Likewise.
	(BUILT_IN_GOMP_OFFLOAD_UNREGISTER): Likewise.
	(BUILT_IN_GOMP_TARGET): Changed type.
	* omp-low.c: Include symbol-summary.h and hsa.h.
	(adjust_for_condition): New function.
	(get_omp_for_step_from_incr): Likewise.
	(extract_omp_for_data): Moved parts to adjust_for_condition and
	get_omp_for_step_from_incr.
	(build_outer_var_ref): Handle GIMPLE_OMP_GPUKERNEL.
	(fixup_child_record_type): Bail out if receiver_decl is NULL.
	(scan_omp_parallel): Do not create child functions for phony
	constructs.
	(scan_omp_target): Scan target dimensions.
	(check_omp_nesting_restrictions): Handle GIMPLE_OMP_GPUKERNEL.
	(scan_omp_1_stmt): Likewise.
	(region_needs_kernel_p): New function.
	(expand_parallel_call): Register apprpriate parallel child
	functions as HSA kernels.
	(kernel_dim_array_type, kernel_lattrs_dimnum_decl): New variables.
	(kernel_lattrs_grid_decl, kernel_lattrs_group_decl): Likewise.
	(kernel_launch_attributes_type): Likewise.
	(create_kernel_launch_attr_types): New function.
	(insert_store_range_dim): Likewise.
	(get_kernel_launch_attributes): Likewise.
	(expand_omp_target): Fill in kernel dimensions, if any.
	(expand_omp_for_kernel): New function.
	(arg_decl_map): New type.
	(remap_kernel_arg_accesses): New function.
	(expand_omp): New forward declaration.
	(expand_target_kernel_body): New function.
	(expand_omp): Call it.
	(lower_omp_for): Do not emit phony constructs.
	(lower_omp_for): Likewise.
	(lower_omp_taskreg): Do not emit phony constructs but create for them
	a temporary variable receiver_decl.
	(lower_omp_taskreg): Do not emit phony constructs.
	(lower_omp_teams): Likewise.
	(lower_omp_gpukernel): New function.
	(lower_omp_1): Call it.
	(reg_assignment_to_local_var_p): New function.
	(seq_only_contains_local_assignments): Likewise.
	(find_single_omp_among_assignments_1): Likewise.
	(find_single_omp_among_assignments): Likewise.
	(find_ungridifiable_statement): Likewise.
	(target_follows_gridifiable_pattern): Likewise.
	(remap_prebody_decls): Likewise.
	(copy_leading_local_assignments): Likewise.
	(process_kernel_body_copy): Likewise.
	(attempt_target_gridification): Likewise.
	(create_target_gpukernel_stmt): Likewise.
	(create_target_gpukernels): Likewise.
	(execute_lower_omp): Call create_target_gpukernels.
	(make_gimple_omp_edges): Handle GIMPLE_OMP_GPUKERNEL.

diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index b561436..e2fa418 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -450,6 +450,7 @@ DEF_FUNCTION_TYPE_3 (BT_FN_BOOL_ULONG_ULONG_ULONGPTR, BT_BOOL, BT_ULONG,
 		     BT_ULONG, BT_PTR_ULONG)
 DEF_FUNCTION_TYPE_3 (BT_FN_BOOL_ULONGLONG_ULONGLONG_ULONGLONGPTR, BT_BOOL,
 		     BT_ULONGLONG, BT_ULONGLONG, BT_PTR_ULONGLONG)
+DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_INT_PTR, BT_VOID, BT_PTR, BT_INT, BT_PTR)
 
 DEF_FUNCTION_TYPE_4 (BT_FN_SIZE_CONST_PTR_SIZE_SIZE_FILEPTR,
 		     BT_SIZE, BT_CONST_PTR, BT_SIZE, BT_SIZE, BT_FILEPTR)
@@ -547,13 +548,13 @@ DEF_FUNCTION_TYPE_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR,
 		     BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR, BT_PTR, BT_UINT,
 		     BT_PTR)
 
-DEF_FUNCTION_TYPE_8 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR,
-		     BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
-		     BT_PTR, BT_PTR, BT_UINT, BT_PTR)
 DEF_FUNCTION_TYPE_8 (BT_FN_VOID_OMPFN_PTR_UINT_LONG_LONG_LONG_LONG_UINT,
 		     BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT,
 		     BT_LONG, BT_LONG, BT_LONG, BT_LONG, BT_UINT)
 
+DEF_FUNCTION_TYPE_9 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
+		     BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
+		     BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_PTR)
 DEF_FUNCTION_TYPE_9 (BT_FN_VOID_OMPFN_PTR_OMPCPYFN_LONG_LONG_BOOL_UINT_PTR_INT,
 		     BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR,
 		     BT_PTR_FN_VOID_PTR_PTR, BT_LONG, BT_LONG,
diff --git a/gcc/fortran/types.def b/gcc/fortran/types.def
index ca75654..a9cfc84 100644
--- a/gcc/fortran/types.def
+++ b/gcc/fortran/types.def
@@ -145,6 +145,7 @@ DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I2_INT, BT_VOID, BT_VOLATILE_PTR, BT_I2, BT
 DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I4_INT, BT_VOID, BT_VOLATILE_PTR, BT_I4, BT_INT)
 DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I8_INT, BT_VOID, BT_VOLATILE_PTR, BT_I8, BT_INT)
 DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I16_INT, BT_VOID, BT_VOLATILE_PTR, BT_I16, BT_INT)
+DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_INT_PTR, BT_VOID, BT_PTR, BT_INT, BT_PTR)
 
 DEF_FUNCTION_TYPE_4 (BT_FN_VOID_OMPFN_PTR_UINT_UINT,
                      BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT, BT_UINT)
@@ -215,9 +216,9 @@ DEF_FUNCTION_TYPE_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR,
 DEF_FUNCTION_TYPE_8 (BT_FN_VOID_OMPFN_PTR_UINT_LONG_LONG_LONG_LONG_UINT,
 		     BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT,
 		     BT_LONG, BT_LONG, BT_LONG, BT_LONG, BT_UINT)
-DEF_FUNCTION_TYPE_8 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR,
+DEF_FUNCTION_TYPE_9 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
 		     BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
-		     BT_PTR, BT_PTR, BT_UINT, BT_PTR)
+		     BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_PTR)
 
 DEF_FUNCTION_TYPE_9 (BT_FN_VOID_OMPFN_PTR_OMPCPYFN_LONG_LONG_BOOL_UINT_PTR_INT,
 		     BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR,
diff --git a/gcc/gimple-low.c b/gcc/gimple-low.c
index 4994918..d2a6a80 100644
--- a/gcc/gimple-low.c
+++ b/gcc/gimple-low.c
@@ -358,6 +358,7 @@ lower_stmt (gimple_stmt_iterator *gsi, struct lower_data *data)
     case GIMPLE_OMP_TASK:
     case GIMPLE_OMP_TARGET:
     case GIMPLE_OMP_TEAMS:
+    case GIMPLE_OMP_GPUKERNEL:
       data->cannot_fallthru = false;
       lower_omp_directive (gsi, data);
       data->cannot_fallthru = false;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 7b50cdf..83498bc 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -1187,6 +1187,9 @@ dump_gimple_omp_for (pretty_printer *buffer, gomp_for *gs, int spc, int flags)
 	case GF_OMP_FOR_KIND_CILKSIMD:
 	  pp_string (buffer, "#pragma simd");
 	  break;
+	case GF_OMP_FOR_KIND_KERNEL_BODY:
+	  pp_string (buffer, "#pragma omp for kernel");
+	  break;
 	default:
 	  gcc_unreachable ();
 	}
@@ -1488,6 +1491,9 @@ dump_gimple_omp_block (pretty_printer *buffer, gimple *gs, int spc, int flags)
 	case GIMPLE_OMP_SECTION:
 	  pp_string (buffer, "#pragma omp section");
 	  break;
+	case GIMPLE_OMP_GPUKERNEL:
+	  pp_string (buffer, "#pragma omp gpukernel");
+	  break;
 	default:
 	  gcc_unreachable ();
 	}
@@ -2270,6 +2276,7 @@ pp_gimple_stmt_1 (pretty_printer *buffer, gimple *gs, int spc, int flags)
     case GIMPLE_OMP_MASTER:
     case GIMPLE_OMP_TASKGROUP:
     case GIMPLE_OMP_SECTION:
+    case GIMPLE_OMP_GPUKERNEL:
       dump_gimple_omp_block (buffer, gs, spc, flags);
       break;
 
diff --git a/gcc/gimple-walk.c b/gcc/gimple-walk.c
index 850cf57..695592d 100644
--- a/gcc/gimple-walk.c
+++ b/gcc/gimple-walk.c
@@ -644,6 +644,7 @@ walk_gimple_stmt (gimple_stmt_iterator *gsi, walk_stmt_fn callback_stmt,
     case GIMPLE_OMP_SINGLE:
     case GIMPLE_OMP_TARGET:
     case GIMPLE_OMP_TEAMS:
+    case GIMPLE_OMP_GPUKERNEL:
       ret = walk_gimple_seq_mod (gimple_omp_body_ptr (stmt), callback_stmt,
 			     callback_op, wi);
       if (ret)
diff --git a/gcc/gimple.c b/gcc/gimple.c
index 4ce38da..9eba126 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -953,6 +953,19 @@ gimple_build_omp_master (gimple_seq body)
   return p;
 }
 
+/* Build a GIMPLE_OMP_GPUKERNEL statement.
+
+   BODY is the sequence of statements to be executed by the kernel.  */
+
+gimple *
+gimple_build_omp_gpukernel (gimple_seq body)
+{
+  gimple *p = gimple_alloc (GIMPLE_OMP_GPUKERNEL, 0);
+  if (body)
+    gimple_omp_set_body (p, body);
+
+  return p;
+}
 
 /* Build a GIMPLE_OMP_TASKGROUP statement.
 
@@ -1084,6 +1097,16 @@ gimple_build_omp_target (gimple_seq body, int kind, tree clauses)
   return p;
 }
 
+/* Set dimensions of TARGET to NUM and allocate kernel_dim array of the
+   statement with the appropriate number of elements.  */
+
+void
+gimple_omp_target_init_dimensions (gomp_target *target, size_t num)
+{
+  gcc_assert (num > 0);
+  target->dimensions = num;
+  target->kernel_dim = ggc_cleared_vec_alloc<gimple_omp_target_grid_dim> (num);
+}
 
 /* Build a GIMPLE_OMP_TEAMS statement.
 
@@ -1804,6 +1827,7 @@ gimple_copy (gimple *stmt)
 	case GIMPLE_OMP_SECTION:
 	case GIMPLE_OMP_MASTER:
 	case GIMPLE_OMP_TASKGROUP:
+	case GIMPLE_OMP_GPUKERNEL:
 	copy_omp_body:
 	  new_seq = gimple_seq_copy (gimple_omp_body (stmt));
 	  gimple_omp_set_body (copy, new_seq);
diff --git a/gcc/gimple.def b/gcc/gimple.def
index d3ca402..30f0111 100644
--- a/gcc/gimple.def
+++ b/gcc/gimple.def
@@ -369,13 +369,17 @@ DEFGSCODE(GIMPLE_OMP_TARGET, "gimple_omp_target", GSS_OMP_PARALLEL_LAYOUT)
 /* GIMPLE_OMP_TEAMS <BODY, CLAUSES> represents #pragma omp teams
    BODY is the sequence of statements inside the single section.
    CLAUSES is an OMP_CLAUSE chain holding the associated clauses.  */
-DEFGSCODE(GIMPLE_OMP_TEAMS, "gimple_omp_teams", GSS_OMP_SINGLE_LAYOUT)
+DEFGSCODE(GIMPLE_OMP_TEAMS, "gimple_omp_teams", GSS_OMP_TEAMS_LAYOUT)
 
 /* GIMPLE_OMP_ORDERED <BODY, CLAUSES> represents #pragma omp ordered.
    BODY is the sequence of statements to execute in the ordered section.
    CLAUSES is an OMP_CLAUSE chain holding the associated clauses.  */
 DEFGSCODE(GIMPLE_OMP_ORDERED, "gimple_omp_ordered", GSS_OMP_SINGLE_LAYOUT)
 
+/* GIMPLE_OMP_GPUKERNEL <BODY> represents a parallel loop lowered for execution
+   on a GPU.  It is an artificial statement created by omp lowering.  */
+DEFGSCODE(GIMPLE_OMP_GPUKERNEL, "gimple_omp_gpukernel", GSS_OMP)
+
 /* GIMPLE_PREDICT <PREDICT, OUTCOME> specifies a hint for branch prediction.
 
    PREDICT is one of the predictors from predict.def.
diff --git a/gcc/gimple.h b/gcc/gimple.h
index 781801b..a32d83c 100644
--- a/gcc/gimple.h
+++ b/gcc/gimple.h
@@ -153,6 +153,7 @@ enum gf_mask {
     GF_OMP_FOR_KIND_TASKLOOP	= 2,
     GF_OMP_FOR_KIND_CILKFOR     = 3,
     GF_OMP_FOR_KIND_OACC_LOOP	= 4,
+    GF_OMP_FOR_KIND_KERNEL_BODY = 5,
     /* Flag for SIMD variants of OMP_FOR kinds.  */
     GF_OMP_FOR_SIMD		= 1 << 3,
     GF_OMP_FOR_KIND_SIMD	= GF_OMP_FOR_SIMD | 0,
@@ -621,8 +622,24 @@ struct GTY((tag("GSS_OMP_FOR")))
   /* [ WORD 11 ]
      Pre-body evaluated before the loop body begins.  */
   gimple_seq pre_body;
+
+  /* [ WORD 12 ]
+     If set, this statement is part of a gridified kernel, its clauses need to
+     be scanned and lowered but the statement should be discarded after
+     lowering.  */
+  bool kernel_phony;
 };
 
+/* Descriptor of one dimension of a kernel grid.  */
+
+struct GTY(()) gimple_omp_target_grid_dim
+{
+  /* Size of the whole grid in the respective  dimension.  */
+  tree grid_size;
+
+  /* Size of the workgroup in the respective dimension.  */
+  tree workgroup_size;
+};
 
 /* GIMPLE_OMP_PARALLEL, GIMPLE_OMP_TARGET, GIMPLE_OMP_TASK */
 
@@ -642,6 +659,26 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT")))
   /* [ WORD 10 ]
      Shared data argument.  */
   tree data_arg;
+
+  /* TODO: Revisit placement of the following two fields.  On one hand, we
+     currently only use them on target construct.  On the other, use on
+     parallel construct is also possible in the future.  */
+
+  /* [ WORD 11 ] */
+  /* Number of elements in kernel_iter array.  */
+  size_t dimensions;
+
+  /* [ WORD 12 ] */
+  /* If target also contains a GPU kernel, it should be run with the
+     following grid sizes.  */
+  struct gimple_omp_target_grid_dim
+    * GTY((length ("%h.dimensions"))) kernel_dim;
+
+  /* [ WORD 13 ] */
+  /* If set, this statement is part of a gridified kernel, its clauses need to
+     be scanned and lowered but the statement should be discarded after
+     lowering.  */
+  bool kernel_phony;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -724,14 +761,14 @@ struct GTY((tag("GSS_OMP_CONTINUE")))
   tree control_use;
 };
 
-/* GIMPLE_OMP_SINGLE, GIMPLE_OMP_TEAMS, GIMPLE_OMP_ORDERED */
+/* GIMPLE_OMP_SINGLE, GIMPLE_OMP_ORDERED */
 
 struct GTY((tag("GSS_OMP_SINGLE_LAYOUT")))
   gimple_statement_omp_single_layout : public gimple_statement_omp
 {
   /* [ WORD 1-7 ] : base class */
 
-  /* [ WORD 7 ]  */
+  /* [ WORD 8 ]  */
   tree clauses;
 };
 
@@ -742,11 +779,18 @@ struct GTY((tag("GSS_OMP_SINGLE_LAYOUT")))
          stmt->code == GIMPLE_OMP_SINGLE.  */
 };
 
-struct GTY((tag("GSS_OMP_SINGLE_LAYOUT")))
+/* GIMPLE_OMP_TEAMS */
+
+struct GTY((tag("GSS_OMP_TEAMS_LAYOUT")))
   gomp_teams : public gimple_statement_omp_single_layout
 {
-    /* No extra fields; adds invariant:
-         stmt->code == GIMPLE_OMP_TEAMS.  */
+  /* [ WORD 1-8 ] : base class */
+
+  /* [ WORD 9 ]
+     If set, this statement is part of a gridified kernel, its clauses need to
+     be scanned and lowered but the statement should be discarded after
+     lowering.  */
+  bool kernel_phony;
 };
 
 struct GTY((tag("GSS_OMP_SINGLE_LAYOUT")))
@@ -1450,6 +1494,7 @@ gomp_task *gimple_build_omp_task (gimple_seq, tree, tree, tree, tree,
 				       tree, tree);
 gimple *gimple_build_omp_section (gimple_seq);
 gimple *gimple_build_omp_master (gimple_seq);
+gimple *gimple_build_omp_gpukernel (gimple_seq);
 gimple *gimple_build_omp_taskgroup (gimple_seq);
 gomp_continue *gimple_build_omp_continue (tree, tree);
 gomp_ordered *gimple_build_omp_ordered (gimple_seq, tree);
@@ -1458,6 +1503,7 @@ gomp_sections *gimple_build_omp_sections (gimple_seq, tree);
 gimple *gimple_build_omp_sections_switch (void);
 gomp_single *gimple_build_omp_single (gimple_seq, tree);
 gomp_target *gimple_build_omp_target (gimple_seq, int, tree);
+void gimple_omp_target_init_dimensions (gomp_target *, size_t);
 gomp_teams *gimple_build_omp_teams (gimple_seq, tree);
 gomp_atomic_load *gimple_build_omp_atomic_load (tree, tree);
 gomp_atomic_store *gimple_build_omp_atomic_store (tree);
@@ -1708,6 +1754,7 @@ gimple_has_substatements (gimple *g)
     case GIMPLE_OMP_CRITICAL:
     case GIMPLE_WITH_CLEANUP_EXPR:
     case GIMPLE_TRANSACTION:
+    case GIMPLE_OMP_GPUKERNEL:
       return true;
 
     default:
@@ -5077,6 +5124,21 @@ gimple_omp_for_set_pre_body (gimple *gs, gimple_seq pre_body)
   omp_for_stmt->pre_body = pre_body;
 }
 
+/* Return the kernel_phony of OMP_FOR statement.  */
+
+static inline bool
+gimple_omp_for_kernel_phony (const gomp_for *omp_for)
+{
+  return omp_for->kernel_phony;
+}
+
+/* Set kernel_phony flag of OMP_FOR to VALUE.  */
+
+static inline void
+gimple_omp_for_set_kernel_phony (gomp_for *omp_for, bool value)
+{
+  omp_for->kernel_phony = value;
+}
 
 /* Return the clauses associated with OMP_PARALLEL GS.  */
 
@@ -5163,6 +5225,22 @@ gimple_omp_parallel_set_data_arg (gomp_parallel *omp_parallel_stmt,
   omp_parallel_stmt->data_arg = data_arg;
 }
 
+/* Return the kernel_phony flag of OMP_PARALLEL_STMT.  */
+
+static inline bool
+gimple_omp_parallel_kernel_phony (const gomp_parallel *omp_parallel_stmt)
+{
+  return omp_parallel_stmt->kernel_phony;
+}
+
+/* Set kernel_phony flag of OMP_PARALLEL_STMT to VALUE.  */
+
+static inline void
+gimple_omp_parallel_set_kernel_phony (gomp_parallel *omp_parallel_stmt,
+				      bool value)
+{
+  omp_parallel_stmt->kernel_phony = value;
+}
 
 /* Return the clauses associated with OMP_TASK GS.  */
 
@@ -5607,6 +5685,72 @@ gimple_omp_target_set_data_arg (gomp_target *omp_target_stmt,
   omp_target_stmt->data_arg = data_arg;
 }
 
+/* Return the number of dimensions of kernel grid.  */
+
+static inline size_t
+gimple_omp_target_dimensions (gomp_target *omp_target_stmt)
+{
+  return omp_target_stmt->dimensions;
+}
+
+/* Return the size of kernel grid of OMP_TARGET_STMT along dimension N.  */
+
+static inline tree
+gimple_omp_target_grid_size (gomp_target *omp_target_stmt, unsigned n)
+{
+  gcc_assert (gimple_omp_target_dimensions (omp_target_stmt) > n);
+  return omp_target_stmt->kernel_dim[n].grid_size;
+}
+
+/* Return pointer to tree specifying the size of kernel grid of OMP_TARGET_STMT
+   along dimension N.  */
+
+static inline tree *
+gimple_omp_target_grid_size_ptr (gomp_target *omp_target_stmt, unsigned n)
+{
+  gcc_assert (gimple_omp_target_dimensions (omp_target_stmt) > n);
+  return &omp_target_stmt->kernel_dim[n].grid_size;
+}
+
+/* Set the size of kernel grid of OMP_TARGET_STMT along dimension N to V  */
+
+static inline void
+gimple_omp_target_set_grid_size (gomp_target *omp_target_stmt, unsigned n,
+				 tree v)
+{
+  gcc_assert (gimple_omp_target_dimensions (omp_target_stmt) > n);
+  omp_target_stmt->kernel_dim[n].grid_size = v;
+}
+
+/* Return the size of kernel work group of OMP_TARGET_STMT along dimension N.  */
+
+static inline tree
+gimple_omp_target_workgroup_size (gomp_target *omp_target_stmt, unsigned n)
+{
+  gcc_assert (gimple_omp_target_dimensions (omp_target_stmt) > n);
+  return omp_target_stmt->kernel_dim[n].workgroup_size;
+}
+
+/* Return pointer to tree specifying the size of kernel work group of
+   OMP_TARGET_STMT along dimension N.  */
+
+static inline tree *
+gimple_omp_target_workgroup_size_ptr (gomp_target *omp_target_stmt, unsigned n)
+{
+  gcc_assert (gimple_omp_target_dimensions (omp_target_stmt) > n);
+  return &omp_target_stmt->kernel_dim[n].workgroup_size;
+}
+
+/* Set the size of kernel workgroup of OMP_TARGET_STMT along dimension N to
+   V */
+
+static inline void
+gimple_omp_target_set_workgroup_size (gomp_target *omp_target_stmt, unsigned n,
+				      tree v)
+{
+  gcc_assert (gimple_omp_target_dimensions (omp_target_stmt) > n);
+  omp_target_stmt->kernel_dim[n].workgroup_size = v;
+}
 
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
@@ -5636,6 +5780,21 @@ gimple_omp_teams_set_clauses (gomp_teams *omp_teams_stmt, tree clauses)
   omp_teams_stmt->clauses = clauses;
 }
 
+/* Return the kernel_phony flag of an OMP_TEAMS_STMT.  */
+
+static inline bool
+gimple_omp_teams_kernel_phony (const gomp_teams *omp_teams_stmt)
+{
+  return omp_teams_stmt->kernel_phony;
+}
+
+/* Set kernel_phony flag of an OMP_TEAMS_STMT to VALUE.  */
+
+static inline void
+gimple_omp_teams_set_kernel_phony (gomp_teams *omp_teams_stmt, bool value)
+{
+  omp_teams_stmt->kernel_phony = value;
+}
 
 /* Return the clauses associated with OMP_SECTIONS GS.  */
 
@@ -5965,7 +6124,8 @@ gimple_return_set_retbnd (gimple *gs, tree retval)
     case GIMPLE_OMP_RETURN:			\
     case GIMPLE_OMP_ATOMIC_LOAD:		\
     case GIMPLE_OMP_ATOMIC_STORE:		\
-    case GIMPLE_OMP_CONTINUE
+    case GIMPLE_OMP_CONTINUE:			\
+    case GIMPLE_OMP_GPUKERNEL
 
 static inline bool
 is_gimple_omp (const gimple *stmt)
diff --git a/gcc/gsstruct.def b/gcc/gsstruct.def
index d84e098..9d6b0ef 100644
--- a/gcc/gsstruct.def
+++ b/gcc/gsstruct.def
@@ -47,6 +47,7 @@ DEFGSSTRUCT(GSS_OMP_PARALLEL_LAYOUT, gimple_statement_omp_parallel_layout, false
 DEFGSSTRUCT(GSS_OMP_TASK, gomp_task, false)
 DEFGSSTRUCT(GSS_OMP_SECTIONS, gomp_sections, false)
 DEFGSSTRUCT(GSS_OMP_SINGLE_LAYOUT, gimple_statement_omp_single_layout, false)
+DEFGSSTRUCT(GSS_OMP_TEAMS_LAYOUT, gomp_teams, false)
 DEFGSSTRUCT(GSS_OMP_CONTINUE, gomp_continue, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_LOAD, gomp_atomic_load, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_STORE_LAYOUT, gomp_atomic_store, false)
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index ea9cf0d..59c677b 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -302,8 +302,12 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_START, "GOMP_single_copy_start",
 		  BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_END, "GOMP_single_copy_end",
 		  BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_REGISTER, "GOMP_offload_register",
+		  BT_FN_VOID_PTR_INT_PTR, ATTR_NOTHROW_LIST)
+DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_UNREGISTER, "GOMP_offload_unregister",
+		  BT_FN_VOID_PTR_INT_PTR, ATTR_NOTHROW_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TARGET, "GOMP_target_41",
-		  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR,
+		  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
 		  ATTR_NOTHROW_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TARGET_DATA, "GOMP_target_data_41",
 		  BT_FN_VOID_INT_SIZE_PTR_PTR_PTR, ATTR_NOTHROW_LIST)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index d0264e9..379535c 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -80,6 +80,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "lto-section-names.h"
 #include "gomp-constants.h"
 #include "gimple-pretty-print.h"
+#include "symbol-summary.h"
+#include "hsa.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -510,6 +512,63 @@ is_combined_parallel (struct omp_region *region)
   return region->is_combined_parallel;
 }
 
+/* Adjust *COND_CODE and *N2 so that the former is either LT_EXPR or
+   GT_EXPR.  */
+
+static void
+adjust_for_condition (location_t loc, enum tree_code *cond_code, tree *n2)
+{
+  switch (*cond_code)
+    {
+    case LT_EXPR:
+    case GT_EXPR:
+    case NE_EXPR:
+      break;
+    case LE_EXPR:
+      if (POINTER_TYPE_P (TREE_TYPE (*n2)))
+	*n2 = fold_build_pointer_plus_hwi_loc (loc, *n2, 1);
+      else
+	*n2 = fold_build2_loc (loc, PLUS_EXPR, TREE_TYPE (*n2), *n2,
+			       build_int_cst (TREE_TYPE (*n2), 1));
+      *cond_code = LT_EXPR;
+      break;
+    case GE_EXPR:
+      if (POINTER_TYPE_P (TREE_TYPE (*n2)))
+	*n2 = fold_build_pointer_plus_hwi_loc (loc, *n2, -1);
+      else
+	*n2 = fold_build2_loc (loc, MINUS_EXPR, TREE_TYPE (*n2), *n2,
+			       build_int_cst (TREE_TYPE (*n2), 1));
+      *cond_code = GT_EXPR;
+      break;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Return the looping step from INCR, extracted from the step of a gimple omp
+   for statement.  */
+
+static tree
+get_omp_for_step_from_incr (location_t loc, tree incr)
+{
+  tree step;
+  switch (TREE_CODE (incr))
+    {
+    case PLUS_EXPR:
+      step = TREE_OPERAND (incr, 1);
+      break;
+    case POINTER_PLUS_EXPR:
+      step = fold_convert (ssizetype, TREE_OPERAND (incr, 1));
+      break;
+    case MINUS_EXPR:
+      step = TREE_OPERAND (incr, 1);
+      step = fold_build1_loc (loc, NEGATE_EXPR, TREE_TYPE (step), step);
+      break;
+    default:
+      gcc_unreachable ();
+    }
+  return step;
+}
 
 /* Extract the header elements of parallel loop FOR_STMT and store
    them into *FD.  */
@@ -634,58 +693,14 @@ extract_omp_for_data (gomp_for *for_stmt, struct omp_for_data *fd,
 
       loop->cond_code = gimple_omp_for_cond (for_stmt, i);
       loop->n2 = gimple_omp_for_final (for_stmt, i);
-      switch (loop->cond_code)
-	{
-	case LT_EXPR:
-	case GT_EXPR:
-	  break;
-	case NE_EXPR:
-	  gcc_assert (gimple_omp_for_kind (for_stmt)
-		      == GF_OMP_FOR_KIND_CILKSIMD
-		      || (gimple_omp_for_kind (for_stmt)
-			  == GF_OMP_FOR_KIND_CILKFOR));
-	  break;
-	case LE_EXPR:
-	  if (POINTER_TYPE_P (TREE_TYPE (loop->n2)))
-	    loop->n2 = fold_build_pointer_plus_hwi_loc (loc, loop->n2, 1);
-	  else
-	    loop->n2 = fold_build2_loc (loc,
-				    PLUS_EXPR, TREE_TYPE (loop->n2), loop->n2,
-				    build_int_cst (TREE_TYPE (loop->n2), 1));
-	  loop->cond_code = LT_EXPR;
-	  break;
-	case GE_EXPR:
-	  if (POINTER_TYPE_P (TREE_TYPE (loop->n2)))
-	    loop->n2 = fold_build_pointer_plus_hwi_loc (loc, loop->n2, -1);
-	  else
-	    loop->n2 = fold_build2_loc (loc,
-				    MINUS_EXPR, TREE_TYPE (loop->n2), loop->n2,
-				    build_int_cst (TREE_TYPE (loop->n2), 1));
-	  loop->cond_code = GT_EXPR;
-	  break;
-	default:
-	  gcc_unreachable ();
-	}
+      gcc_assert (loop->cond_code != NE_EXPR
+		  || gimple_omp_for_kind (for_stmt) == GF_OMP_FOR_KIND_CILKSIMD
+		  || gimple_omp_for_kind (for_stmt) == GF_OMP_FOR_KIND_CILKFOR);
+      adjust_for_condition (loc, &loop->cond_code, &loop->n2);
 
       t = gimple_omp_for_incr (for_stmt, i);
       gcc_assert (TREE_OPERAND (t, 0) == var);
-      switch (TREE_CODE (t))
-	{
-	case PLUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
-	  break;
-	case POINTER_PLUS_EXPR:
-	  loop->step = fold_convert (ssizetype, TREE_OPERAND (t, 1));
-	  break;
-	case MINUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
-	  loop->step = fold_build1_loc (loc,
-				    NEGATE_EXPR, TREE_TYPE (loop->step),
-				    loop->step);
-	  break;
-	default:
-	  gcc_unreachable ();
-	}
+      loop->step = get_omp_for_step_from_incr (loc, t);
 
       if (simd
 	  || (fd->sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
@@ -1389,7 +1404,16 @@ build_outer_var_ref (tree var, omp_context *ctx, bool lastprivate = false)
 	}
     }
   else if (ctx->outer)
-    x = lookup_decl (var, ctx->outer);
+    {
+      omp_context *outer = ctx->outer;
+      if (gimple_code (outer->stmt) == GIMPLE_OMP_GPUKERNEL)
+	{
+	  outer = outer->outer;
+	  gcc_assert (outer
+		      && gimple_code (outer->stmt) != GIMPLE_OMP_GPUKERNEL);
+	}
+	x = lookup_decl (var, outer);
+    }
   else if (is_reference (var))
     /* This can happen with orphaned constructs.  If var is reference, it is
        possible it is shared and as such valid.  */
@@ -1837,6 +1861,8 @@ fixup_child_record_type (omp_context *ctx)
 {
   tree f, type = ctx->record_type;
 
+  if (!ctx->receiver_decl)
+    return;
   /* ??? It isn't sufficient to just call remap_type here, because
      variably_modified_type_p doesn't work the way we expect for
      record types.  Testing each field for whether it needs remapping
@@ -2730,8 +2756,11 @@ scan_omp_parallel (gimple_stmt_iterator *gsi, omp_context *outer_ctx)
   DECL_NAMELESS (name) = 1;
   TYPE_NAME (ctx->record_type) = name;
   TYPE_ARTIFICIAL (ctx->record_type) = 1;
-  create_omp_child_function (ctx, false);
-  gimple_omp_parallel_set_child_fn (stmt, ctx->cb.dst_fn);
+  if (!gimple_omp_parallel_kernel_phony (stmt))
+    {
+      create_omp_child_function (ctx, false);
+      gimple_omp_parallel_set_child_fn (stmt, ctx->cb.dst_fn);
+    }
 
   scan_sharing_clauses (gimple_omp_parallel_clauses (stmt), ctx);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
@@ -3156,6 +3185,13 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   DECL_NAMELESS (name) = 1;
   TYPE_NAME (ctx->record_type) = name;
   TYPE_ARTIFICIAL (ctx->record_type) = 1;
+
+  for (size_t i = 0; i < gimple_omp_target_dimensions (stmt); i++)
+    {
+      scan_omp_op (gimple_omp_target_grid_size_ptr (stmt, i), ctx);
+      scan_omp_op (gimple_omp_target_workgroup_size_ptr (stmt, i), ctx);
+    }
+
   if (offloaded)
     {
       if (is_gimple_omp_oacc (stmt))
@@ -3205,6 +3241,11 @@ check_omp_nesting_restrictions (gimple *stmt, omp_context *ctx)
 {
   tree c;
 
+  if (ctx && gimple_code (ctx->stmt) == GIMPLE_OMP_GPUKERNEL)
+    /* GPUKERNEL is an artificial construct, nesting rules will be checked in
+       the original copy of its contents.  */
+    return true;
+
   /* No nesting of non-OpenACC STMT (that is, an OpenMP one, or a GOMP builtin)
      inside an OpenACC CTX.  */
   if (!(is_gimple_omp (stmt)
@@ -3831,6 +3872,7 @@ scan_omp_1_stmt (gimple_stmt_iterator *gsi, bool *handled_ops_p,
     case GIMPLE_OMP_TASKGROUP:
     case GIMPLE_OMP_ORDERED:
     case GIMPLE_OMP_CRITICAL:
+    case GIMPLE_OMP_GPUKERNEL:
       ctx = new_omp_context (stmt, ctx);
       scan_omp (gimple_omp_body_ptr (stmt), ctx);
       break;
@@ -6082,6 +6124,35 @@ gimple_build_cond_empty (tree cond)
   return gimple_build_cond (pred_code, lhs, rhs, NULL_TREE, NULL_TREE);
 }
 
+/* Return true if a parallel REGION is within a declare target function or
+   within a target region and is not a part of a gridified kernel.  */
+
+static bool
+region_needs_kernel_p (struct omp_region *region)
+{
+  bool indirect = false;
+  for (region = region->outer; region; region = region->outer)
+    {
+      if (region->type == GIMPLE_OMP_PARALLEL)
+	indirect = true;
+      else if (region->type == GIMPLE_OMP_TARGET)
+	{
+	  gomp_target *tgt_stmt;
+	  tgt_stmt = as_a <gomp_target *> (last_stmt (region->entry));
+	  if (gimple_omp_target_dimensions (tgt_stmt))
+	    return indirect;
+	  else
+	    return true;
+	}
+    }
+
+  if (lookup_attribute ("omp declare target",
+			DECL_ATTRIBUTES (current_function_decl)))
+    return true;
+
+  return false;
+}
+
 static void expand_omp_build_assign (gimple_stmt_iterator *, tree, tree,
 				     bool = false);
 
@@ -6236,7 +6307,8 @@ expand_parallel_call (struct omp_region *region, basic_block bb,
     t1 = null_pointer_node;
   else
     t1 = build_fold_addr_expr (t);
-  t2 = build_fold_addr_expr (gimple_omp_parallel_child_fn (entry_stmt));
+  tree child_fndecl = gimple_omp_parallel_child_fn (entry_stmt);
+  t2 = build_fold_addr_expr (child_fndecl);
 
   vec_alloc (args, 4 + vec_safe_length (ws_args));
   args->quick_push (t2);
@@ -6251,6 +6323,13 @@ expand_parallel_call (struct omp_region *region, basic_block bb,
 
   force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
 			    false, GSI_CONTINUE_LINKING);
+
+  if (hsa_gen_requested_p ()
+      && region_needs_kernel_p (region))
+    {
+      cgraph_node *child_cnode = cgraph_node::get (child_fndecl);
+      hsa_register_kernel (child_cnode);
+    }
 }
 
 /* Insert a function call whose name is FUNC_NAME with the information from
@@ -12092,6 +12171,98 @@ get_oacc_fn_attrib (tree fn)
   return lookup_attribute (OACC_FN_ATTRIB, DECL_ATTRIBUTES (fn));
 }
 
+/* Types used to pass grid and wortkgroup sizes to kernel invocation.  */
+
+static GTY(()) tree kernel_dim_array_type;
+static GTY(()) tree kernel_lattrs_dimnum_decl;
+static GTY(()) tree kernel_lattrs_grid_decl;
+static GTY(()) tree kernel_lattrs_group_decl;
+static GTY(()) tree kernel_launch_attributes_type;
+
+/* Create types used to pass kernel launch attributes to target.  */
+
+static void
+create_kernel_launch_attr_types (void)
+{
+  if (kernel_launch_attributes_type)
+    return;
+
+  tree dim_arr_index_type;
+  dim_arr_index_type = build_index_type (build_int_cst (integer_type_node, 2));
+  kernel_dim_array_type = build_array_type (uint32_type_node,
+					    dim_arr_index_type);
+
+  kernel_launch_attributes_type = make_node (RECORD_TYPE);
+  kernel_lattrs_dimnum_decl = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+				       get_identifier ("ndim"),
+				       uint32_type_node);
+  DECL_CHAIN (kernel_lattrs_dimnum_decl) = NULL_TREE;
+
+  kernel_lattrs_grid_decl = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+				     get_identifier ("grid_size"),
+				     kernel_dim_array_type);
+  DECL_CHAIN (kernel_lattrs_grid_decl) = kernel_lattrs_dimnum_decl;
+  kernel_lattrs_group_decl = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+				     get_identifier ("group_size"),
+				     kernel_dim_array_type);
+  DECL_CHAIN (kernel_lattrs_group_decl) = kernel_lattrs_grid_decl;
+  finish_builtin_struct (kernel_launch_attributes_type,
+			 "__gomp_kernel_launch_attributes",
+			 kernel_lattrs_group_decl, NULL_TREE);
+}
+
+/* Insert before the current statement in GSI a store of VALUE to INDEX of
+   array (of type kernel_dim_array_type) FLD_DECL of RANGE_VAR.  VALUE must be
+   of type uint32_type_node.  */
+
+static void
+insert_store_range_dim (gimple_stmt_iterator *gsi, tree range_var,
+			tree fld_decl, int index, tree value)
+{
+  tree ref = build4 (ARRAY_REF, uint32_type_node,
+		     build3 (COMPONENT_REF, kernel_dim_array_type,
+			     range_var, fld_decl, NULL_TREE),
+		     build_int_cst (integer_type_node, index),
+		     NULL_TREE, NULL_TREE);
+  gsi_insert_before (gsi, gimple_build_assign (ref, value), GSI_SAME_STMT);
+}
+
+/* Return a tree representation of a pointer to a structure with grid and
+   work-group size information.  Statements filling that information will be
+   inserted before GSI, TGT_STMT is the target statement which has the
+   necessary information in it.  */
+
+static tree
+get_kernel_launch_attributes (gimple_stmt_iterator *gsi, gomp_target *tgt_stmt)
+{
+  create_kernel_launch_attr_types ();
+  tree u32_one = build_one_cst (uint32_type_node);
+  tree lattrs = create_tmp_var (kernel_launch_attributes_type,
+				"__kernel_launch_attrs");
+  tree dimref = build3 (COMPONENT_REF, uint32_type_node,
+			lattrs, kernel_lattrs_dimnum_decl, NULL_TREE);
+  /* At this moment we cannot gridify a loop with a collapse clause.  */
+  /* TODO: Adjust when we support bigger collapse.  */
+  gcc_assert (gimple_omp_target_dimensions (tgt_stmt) == 1);
+  gsi_insert_before (gsi, gimple_build_assign (dimref, u32_one), GSI_SAME_STMT);
+
+  /* Calculation of grid size: */
+  insert_store_range_dim (gsi, lattrs, kernel_lattrs_grid_decl, 0,
+			  gimple_omp_target_grid_size (tgt_stmt, 0));
+  insert_store_range_dim (gsi, lattrs, kernel_lattrs_group_decl, 0,
+			  gimple_omp_target_workgroup_size (tgt_stmt, 0));
+  insert_store_range_dim (gsi, lattrs, kernel_lattrs_grid_decl, 1,
+			  u32_one);
+  insert_store_range_dim (gsi, lattrs, kernel_lattrs_group_decl, 2,
+			  u32_one);
+  insert_store_range_dim (gsi, lattrs, kernel_lattrs_grid_decl, 2,
+			  u32_one);
+  insert_store_range_dim (gsi, lattrs, kernel_lattrs_group_decl, 1,
+			  u32_one);
+  TREE_ADDRESSABLE (lattrs) = 1;
+  return build_fold_addr_expr (lattrs);
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -12485,6 +12657,10 @@ expand_omp_target (struct omp_region *region)
       else
 	depend = build_int_cst (ptr_type_node, 0);
       args.quick_push (depend);
+      if (gimple_omp_target_dimensions (entry_stmt))
+	args.quick_push (get_kernel_launch_attributes (&gsi, entry_stmt));
+      else
+	args.quick_push (build_zero_cst (ptr_type_node));
       break;
     case BUILT_IN_GOACC_PARALLEL:
       {
@@ -12588,6 +12764,255 @@ expand_omp_target (struct omp_region *region)
     }
 }
 
+/* Expand KFOR loop as a GPGPU kernel, i.e. as a body only with iteration
+   variable derived from the thread number.  */
+
+static void
+expand_omp_for_kernel (struct omp_region *kfor)
+{
+  tree t, threadid;
+  tree type, itype;
+  gimple_stmt_iterator gsi;
+  tree n1, step;
+  struct omp_for_data fd;
+
+  gomp_for *for_stmt = as_a <gomp_for *> (last_stmt (kfor->entry));
+  gcc_checking_assert (gimple_omp_for_kind (for_stmt)
+		       == GF_OMP_FOR_KIND_KERNEL_BODY);
+  basic_block body_bb = FALLTHRU_EDGE (kfor->entry)->dest;
+
+  gcc_assert (gimple_omp_for_collapse (for_stmt) == 1);
+  gcc_assert (kfor->cont);
+  extract_omp_for_data (for_stmt, &fd, NULL);
+
+  itype = type = TREE_TYPE (fd.loop.v);
+  if (POINTER_TYPE_P (type))
+    itype = signed_type_for (type);
+
+  gsi = gsi_start_bb (body_bb);
+
+  n1 = fd.loop.n1;
+  step = fd.loop.step;
+  n1 = force_gimple_operand_gsi (&gsi, fold_convert (type, n1),
+				 true, NULL_TREE, true, GSI_SAME_STMT);
+  step = force_gimple_operand_gsi (&gsi, fold_convert (itype, step),
+				   true, NULL_TREE, true, GSI_SAME_STMT);
+  threadid = build_call_expr (builtin_decl_explicit
+			      (BUILT_IN_OMP_GET_THREAD_NUM), 0);
+  threadid = fold_convert (itype, threadid);
+  threadid = force_gimple_operand_gsi (&gsi, threadid, true, NULL_TREE,
+				       true, GSI_CONTINUE_LINKING);
+
+  tree startvar = fd.loop.v;
+  t = fold_build2 (MULT_EXPR, itype, threadid, step);
+  if (POINTER_TYPE_P (type))
+    t = fold_build_pointer_plus (n1, t);
+  else
+    t = fold_build2 (PLUS_EXPR, type, t, n1);
+  t = fold_convert (type, t);
+  t = force_gimple_operand_gsi (&gsi, t,
+				DECL_P (startvar)
+				&& TREE_ADDRESSABLE (startvar),
+				NULL_TREE, true, GSI_CONTINUE_LINKING);
+  gassign *assign_stmt = gimple_build_assign (startvar, t);
+  gsi_insert_after (&gsi, assign_stmt, GSI_CONTINUE_LINKING);
+
+  /* Remove the omp for statement */
+  gsi = gsi_last_bb (kfor->entry);
+  gsi_remove (&gsi, true);
+
+  /* Remove the GIMPLE_OMP_CONTINUE statement.  */
+  gsi = gsi_last_bb (kfor->cont);
+  gcc_assert (!gsi_end_p (gsi)
+	      && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_CONTINUE);
+  gsi_remove (&gsi, true);
+
+  /* Replace the GIMPLE_OMP_RETURN with a real return.  */
+  gsi = gsi_last_bb (kfor->exit);
+  gcc_assert (!gsi_end_p (gsi)
+	      && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+  gsi_remove (&gsi, true);
+
+  /* Fixup the much simpler CFG.  */
+  remove_edge (find_edge (kfor->cont, body_bb));
+
+  if (kfor->cont != body_bb)
+    set_immediate_dominator (CDI_DOMINATORS, kfor->cont, body_bb);
+  set_immediate_dominator (CDI_DOMINATORS, kfor->exit, kfor->cont);
+}
+
+/* Structure passed to remap_kernel_arg_accesses so that it can remap
+   argument_decls.  */
+
+struct arg_decl_map
+{
+  tree old_arg;
+  tree new_arg;
+};
+
+/* Invoked through walk_gimple_op, will remap all PARM_DECLs to the ones
+   pertaining to kernel function.  */
+
+static tree
+remap_kernel_arg_accesses (tree *tp, int *walk_subtrees, void *data)
+{
+  struct walk_stmt_info *wi = (struct walk_stmt_info *) data;
+  struct arg_decl_map *adm = (struct arg_decl_map *) wi->info;
+  tree t = *tp;
+
+  if (t == adm->old_arg)
+    *tp = adm->new_arg;
+  *walk_subtrees = !TYPE_P (t) && !DECL_P (t);
+  return NULL_TREE;
+}
+
+static void expand_omp (struct omp_region *region);
+
+/* If TARGET region contains a kernel body for loop, remove its region from the
+   TARGET and expand it in GPGPU kernel fashion. */
+
+static void
+expand_target_kernel_body (struct omp_region *target)
+{
+  if (!hsa_gen_requested_p ())
+    return;
+
+  gomp_target *tgt_stmt = as_a <gomp_target *> (last_stmt (target->entry));
+  struct omp_region **pp;
+
+  for (pp = &target->inner; *pp; pp = &(*pp)->next)
+    if ((*pp)->type == GIMPLE_OMP_GPUKERNEL)
+      break;
+
+  struct omp_region *gpukernel = *pp;
+
+  tree orig_child_fndecl = gimple_omp_target_child_fn (tgt_stmt);
+  if (!gpukernel)
+    {
+      /* HSA cannot handle OACC stuff.  */
+      if (gimple_omp_target_kind (tgt_stmt) != GF_OMP_TARGET_KIND_REGION)
+	return;
+      gcc_checking_assert (orig_child_fndecl);
+      gcc_assert (!gimple_omp_target_dimensions (tgt_stmt));
+      cgraph_node *n = cgraph_node::get (orig_child_fndecl);
+
+      hsa_register_kernel (n);
+      return;
+    }
+
+  gcc_assert (gimple_omp_target_dimensions (tgt_stmt));
+  tree inside_block = gimple_block (first_stmt (single_succ (gpukernel->entry)));
+  *pp = gpukernel->next;
+  for (pp = &gpukernel->inner; *pp; pp = &(*pp)->next)
+    if ((*pp)->type == GIMPLE_OMP_FOR)
+      break;
+
+  struct omp_region *kfor = *pp;
+  gcc_assert (kfor);
+  gcc_assert (gimple_omp_for_kind (last_stmt ((kfor)->entry))
+	      == GF_OMP_FOR_KIND_KERNEL_BODY);
+  *pp = kfor->next;
+  if (kfor->inner)
+    expand_omp (kfor->inner);
+  if (gpukernel->inner)
+    expand_omp (gpukernel->inner);
+
+  tree kern_fndecl = copy_node (orig_child_fndecl);
+  DECL_NAME (kern_fndecl) = clone_function_name (kern_fndecl, "kernel");
+  SET_DECL_ASSEMBLER_NAME (kern_fndecl, DECL_NAME (kern_fndecl));
+  tree tgtblock = gimple_block (tgt_stmt);
+  tree fniniblock = make_node (BLOCK);
+  BLOCK_ABSTRACT_ORIGIN (fniniblock) = tgtblock;
+  BLOCK_SOURCE_LOCATION (fniniblock) = BLOCK_SOURCE_LOCATION (tgtblock);
+  BLOCK_SOURCE_END_LOCATION (fniniblock) = BLOCK_SOURCE_END_LOCATION (tgtblock);
+  DECL_INITIAL (kern_fndecl) = fniniblock;
+  push_struct_function (kern_fndecl);
+  cfun->function_end_locus = gimple_location (tgt_stmt);
+  pop_cfun ();
+
+  tree old_parm_decl = DECL_ARGUMENTS (kern_fndecl);
+  gcc_assert (!DECL_CHAIN (old_parm_decl));
+  tree new_parm_decl = copy_node (DECL_ARGUMENTS (kern_fndecl));
+  DECL_CONTEXT (new_parm_decl) = kern_fndecl;
+  DECL_ARGUMENTS (kern_fndecl) = new_parm_decl;
+  struct function *kern_cfun = DECL_STRUCT_FUNCTION (kern_fndecl);
+  kern_cfun->curr_properties = cfun->curr_properties;
+
+  remove_edge (BRANCH_EDGE (kfor->entry));
+  expand_omp_for_kernel (kfor);
+
+  /* Remove the omp for statement */
+  gimple_stmt_iterator gsi = gsi_last_bb (gpukernel->entry);
+  gsi_remove (&gsi, true);
+  /* Replace the GIMPLE_OMP_RETURN at the end of the kernel region with a real
+     return.  */
+  gsi = gsi_last_bb (gpukernel->exit);
+  gcc_assert (!gsi_end_p (gsi)
+	      && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+  gimple *ret_stmt = gimple_build_return (NULL);
+  gsi_insert_after (&gsi, ret_stmt, GSI_SAME_STMT);
+  gsi_remove (&gsi, true);
+
+  /* Statements in the first BB in the target construct have been produced by
+     target lowering and must be copied inside the GPUKERNEL, with the two
+     exceptions of the first OMP statement and the OMP_DATA assignment
+     statement.  */
+  gsi = gsi_start_bb (single_succ (gpukernel->entry));
+  tree data_arg = gimple_omp_target_data_arg (tgt_stmt);
+  tree sender = data_arg ? TREE_VEC_ELT (data_arg, 0) : NULL;
+  for (gimple_stmt_iterator tsi = gsi_start_bb (single_succ (target->entry));
+       !gsi_end_p (tsi); gsi_next (&tsi))
+    {
+      gimple *stmt = gsi_stmt (tsi);
+      if (is_gimple_omp (stmt))
+	break;
+      if (sender
+	  && is_gimple_assign (stmt)
+	  && TREE_CODE (gimple_assign_rhs1 (stmt)) == ADDR_EXPR
+	  && TREE_OPERAND (gimple_assign_rhs1 (stmt), 0) == sender)
+	continue;
+      gimple *copy = gimple_copy (stmt);
+      gsi_insert_before (&gsi, copy, GSI_SAME_STMT);
+      gimple_set_block (copy, fniniblock);
+    }
+
+  move_sese_region_to_fn (kern_cfun, single_succ (gpukernel->entry),
+			  gpukernel->exit, inside_block);
+
+  cgraph_node *kcn = cgraph_node::get_create (kern_fndecl);
+  kcn->mark_force_output ();
+  cgraph_node *orig_child = cgraph_node::get (orig_child_fndecl);
+
+  hsa_register_kernel (kcn, orig_child);
+
+  cgraph_node::add_new_function (kern_fndecl, true);
+  push_cfun (kern_cfun);
+  cgraph_edge::rebuild_edges ();
+
+  /* Re-map any mention of the PARM_DECL of the original function to the
+     PARM_DECL of the new one.
+
+     TODO: It would be great if lowering produced references into the GPU
+     kernel decl straight away and we did not have to do this.  */
+  struct arg_decl_map adm;
+  adm.old_arg = old_parm_decl;
+  adm.new_arg = new_parm_decl;
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, kern_cfun)
+    {
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  struct walk_stmt_info wi;
+	  memset (&wi, 0, sizeof (wi));
+	  wi.info = &adm;
+	  walk_gimple_op (stmt, remap_kernel_arg_accesses, &wi);
+	}
+    }
+  pop_cfun ();
+
+  return;
+}
 
 /* Expand the parallel region tree rooted at REGION.  Expansion
    proceeds in depth-first order.  Innermost regions are expanded
@@ -12607,6 +13032,8 @@ expand_omp (struct omp_region *region)
        	 region.  */
       if (region->type == GIMPLE_OMP_PARALLEL)
 	determine_parallel_type (region);
+      else if (region->type == GIMPLE_OMP_TARGET)
+	expand_target_kernel_body (region);
 
       if (region->type == GIMPLE_OMP_FOR
 	  && gimple_omp_for_combined_p (last_stmt (region->entry)))
@@ -14402,11 +14829,13 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 						ctx);
 	}
 
-  gimple_seq_add_stmt (&body, stmt);
+  if (!gimple_omp_for_kernel_phony (stmt))
+    gimple_seq_add_stmt (&body, stmt);
   gimple_seq_add_seq (&body, gimple_omp_body (stmt));
 
-  gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
-							 fd.loop.v));
+  if (!gimple_omp_for_kernel_phony (stmt))
+    gimple_seq_add_stmt (&body, gimple_build_omp_continue (fd.loop.v,
+							   fd.loop.v));
 
   /* After the loop, add exit clauses.  */
   lower_reduction_clauses (gimple_omp_for_clauses (stmt), &body, ctx);
@@ -14418,9 +14847,12 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   body = maybe_catch_exception (body);
 
-  /* Region exit marker goes at the end of the loop body.  */
-  gimple_seq_add_stmt (&body, gimple_build_omp_return (fd.have_nowait));
-  maybe_add_implicit_barrier_cancel (ctx, &body);
+  if (!gimple_omp_for_kernel_phony (stmt))
+    {
+      /* Region exit marker goes at the end of the loop body.  */
+      gimple_seq_add_stmt (&body, gimple_build_omp_return (fd.have_nowait));
+      maybe_add_implicit_barrier_cancel (ctx, &body);
+    }
 
   /* Add OpenACC joining and reduction markers just after the loop.  */
   if (oacc_tail)
@@ -14863,6 +15294,14 @@ lower_omp_taskreg (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   par_olist = NULL;
   par_ilist = NULL;
   par_rlist = NULL;
+  bool phony_construct = is_a <gomp_parallel *> (stmt)
+    && gimple_omp_parallel_kernel_phony (as_a <gomp_parallel *> (stmt));
+  if (phony_construct && ctx->record_type)
+    {
+      gcc_checking_assert (!ctx->receiver_decl);
+      ctx->receiver_decl = create_tmp_var
+	(build_reference_type (ctx->record_type), ".omp_rec");
+    }
   lower_rec_input_clauses (clauses, &par_ilist, &par_olist, ctx, NULL);
   lower_omp (&par_body, ctx);
   if (gimple_code (stmt) == GIMPLE_OMP_PARALLEL)
@@ -14921,13 +15360,19 @@ lower_omp_taskreg (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     gimple_seq_add_stmt (&new_body,
 			 gimple_build_omp_continue (integer_zero_node,
 						    integer_zero_node));
-  gimple_seq_add_stmt (&new_body, gimple_build_omp_return (false));
-  gimple_omp_set_body (stmt, new_body);
+  if (!phony_construct)
+    {
+      gimple_seq_add_stmt (&new_body, gimple_build_omp_return (false));
+      gimple_omp_set_body (stmt, new_body);
+    }
 
   bind = gimple_build_bind (NULL, NULL, gimple_bind_block (par_bind));
   gsi_replace (gsi_p, dep_bind ? dep_bind : bind, true);
   gimple_bind_add_seq (bind, ilist);
-  gimple_bind_add_stmt (bind, stmt);
+  if (!phony_construct)
+    gimple_bind_add_stmt (bind, stmt);
+  else
+    gimple_bind_add_seq (bind, new_body);
   gimple_bind_add_seq (bind, olist);
 
   pop_gimplify_context (NULL);
@@ -16001,19 +16446,22 @@ lower_omp_teams (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 			   &bind_body, &dlist, ctx, NULL);
   lower_omp (gimple_omp_body_ptr (teams_stmt), ctx);
   lower_reduction_clauses (gimple_omp_teams_clauses (teams_stmt), &olist, ctx);
-  gimple_seq_add_stmt (&bind_body, teams_stmt);
-
-  location_t loc = gimple_location (teams_stmt);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOMP_TEAMS);
-  gimple *call = gimple_build_call (decl, 2, num_teams, thread_limit);
-  gimple_set_location (call, loc);
-  gimple_seq_add_stmt (&bind_body, call);
+  if (!gimple_omp_teams_kernel_phony (teams_stmt))
+    {
+      gimple_seq_add_stmt (&bind_body, teams_stmt);
+      location_t loc = gimple_location (teams_stmt);
+      tree decl = builtin_decl_explicit (BUILT_IN_GOMP_TEAMS);
+      gimple *call = gimple_build_call (decl, 2, num_teams, thread_limit);
+      gimple_set_location (call, loc);
+      gimple_seq_add_stmt (&bind_body, call);
+    }
 
   gimple_seq_add_seq (&bind_body, gimple_omp_body (teams_stmt));
   gimple_omp_set_body (teams_stmt, NULL);
   gimple_seq_add_seq (&bind_body, olist);
   gimple_seq_add_seq (&bind_body, dlist);
-  gimple_seq_add_stmt (&bind_body, gimple_build_omp_return (true));
+  if (!gimple_omp_teams_kernel_phony (teams_stmt))
+    gimple_seq_add_stmt (&bind_body, gimple_build_omp_return (true));
   gimple_bind_set_body (bind, bind_body);
 
   pop_gimplify_context (bind);
@@ -16024,6 +16472,17 @@ lower_omp_teams (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     TREE_USED (block) = 1;
 }
 
+/* Expand code within an artificial GPUKERNELS OMP construct.  */
+
+static void
+lower_omp_gpukernel (gimple_stmt_iterator *gsi_p, omp_context *ctx)
+{
+  gimple *stmt = gsi_stmt (*gsi_p);
+  lower_omp (gimple_omp_body_ptr (stmt), ctx);
+  gimple_seq_add_stmt (gimple_omp_body_ptr (stmt),
+		       gimple_build_omp_return (false));
+}
+
 
 /* Callback for lower_omp_1.  Return non-NULL if *tp needs to be
    regimplified.  If DATA is non-NULL, lower_omp_1 is outside
@@ -16235,6 +16694,11 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
       gcc_assert (ctx);
       lower_omp_teams (gsi_p, ctx);
       break;
+    case GIMPLE_OMP_GPUKERNEL:
+      ctx = maybe_lookup_ctx (stmt);
+      gcc_assert (ctx);
+      lower_omp_gpukernel (gsi_p, ctx);
+      break;
     case GIMPLE_CALL:
       tree fndecl;
       call_stmt = as_a <gcall *> (stmt);
@@ -16324,7 +16788,647 @@ lower_omp (gimple_seq *body, omp_context *ctx)
       fold_stmt (&gsi);
   input_location = saved_location;
 }
-\f
+
+/* Returen true if STMT is an assignment of a register-type into a local
+   VAR_DECL.  */
+
+static bool
+reg_assignment_to_local_var_p (gimple *stmt)
+{
+  gassign *assign = dyn_cast <gassign *> (stmt);
+  if (!assign)
+    return false;
+  tree lhs = gimple_assign_lhs (assign);
+  if (TREE_CODE (lhs) != VAR_DECL
+      || !is_gimple_reg_type (TREE_TYPE (lhs))
+      || is_global_var (lhs))
+    return false;
+  return true;
+}
+
+/* Return true if all statements in SEQ are assignments to local register-type
+   variables.  */
+
+static bool
+seq_only_contains_local_assignments (gimple_seq seq)
+{
+  if (!seq)
+    return true;
+
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start (seq); !gsi_end_p (gsi); gsi_next (&gsi))
+    if (!reg_assignment_to_local_var_p (gsi_stmt (gsi)))
+      return false;
+  return true;
+}
+
+
+/* Scan statements in SEQ and call itself recursively on any bind.  If during
+   whole search only assignments to register-type local variables and one
+   single OMP statement is encountered, return true, otherwise return false.
+   8RET is where we store any OMP statement encountered.  TARGET_LOC and NAME
+   are used for dumping a note about a failure.  */
+
+static bool
+find_single_omp_among_assignments_1 (gimple_seq seq, location_t target_loc,
+				     const char *name, gimple **ret)
+{
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start (seq); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+
+      if (reg_assignment_to_local_var_p (stmt))
+	continue;
+      if (gbind *bind = dyn_cast <gbind *> (stmt))
+	{
+	  if (!find_single_omp_among_assignments_1 (gimple_bind_body (bind),
+						    target_loc, name, ret))
+	      return false;
+	}
+      else if (is_gimple_omp (stmt))
+	{
+	  if (*ret)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, target_loc,
+				 "Will not turn target construct into a simple "
+				 "GPGPU kernel because %s construct contains "
+				 "multiple OpenMP constructs\n", name);
+	      return false;
+	    }
+	  *ret = stmt;
+	}
+      else
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, target_loc,
+			     "Will not turn target construct into a simple "
+			     "GPGPU kernel because %s construct contains "
+			     "a complex statement\n", name);
+	  return false;
+	}
+    }
+  return true;
+}
+
+/* Scan statements in SEQ and make sure that it and any binds in it contain
+   only assignments to local register-type variables and one OMP construct.  If
+   so, return that construct, otherwise return NULL.  If dumping is enabled and
+   function fails, use TARGET_LOC and NAME to dump a note with the reason for
+   failure.  */
+
+static gimple *
+find_single_omp_among_assignments (gimple_seq seq, location_t target_loc,
+				   const char *name)
+{
+  if (!seq)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, target_loc,
+			 "Will not turn target construct into a simple "
+			 "GPGPU kernel because %s construct has empty "
+			 "body\n",
+			 name);
+      return NULL;
+    }
+
+  gimple *ret = NULL;
+  if (find_single_omp_among_assignments_1 (seq, target_loc, name, &ret))
+    {
+      if (!ret && dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, target_loc,
+			 "Will not turn target construct into a simple "
+			 "GPGPU kernel because %s construct does not contain"
+			 "any other OpenMP construct\n", name);
+      return ret;
+    }
+  else
+    return NULL;
+}
+
+/* Walker function looking for statements there is no point gridifying (and for
+   noreturn function calls which we cannot do).  Return non-NULL if such a
+   function is found.  */
+
+static tree
+find_ungridifiable_statement (gimple_stmt_iterator *gsi, bool *handled_ops_p,
+			      struct walk_stmt_info *)
+{
+  *handled_ops_p = false;
+  gimple *stmt = gsi_stmt (*gsi);
+  switch (gimple_code (stmt))
+    {
+    case GIMPLE_CALL:
+      if (gimple_call_noreturn_p (as_a <gcall *> (stmt)))
+	{
+	  *handled_ops_p = true;
+	  return error_mark_node;
+	}
+      break;
+
+    /* We may reduce the following list if we find a way to implement the
+       clauses, but now there is no point trying further.  */
+    case GIMPLE_OMP_CRITICAL:
+    case GIMPLE_OMP_TASKGROUP:
+    case GIMPLE_OMP_TASK:
+    case GIMPLE_OMP_SECTION:
+    case GIMPLE_OMP_SECTIONS:
+    case GIMPLE_OMP_SECTIONS_SWITCH:
+    case GIMPLE_OMP_TARGET:
+    case GIMPLE_OMP_ORDERED:
+      *handled_ops_p = true;
+      return error_mark_node;
+
+    default:
+      break;
+    }
+  return NULL;
+}
+
+
+/* If TARGET follows a pattern that can be turned into a gridified GPGPU
+   kernel, return true, otherwise return false.  In the case of success, also
+   fill in GROUP_SIZE_P with the requested group size or NULL if there is
+   none.  */
+
+static bool
+target_follows_gridifiable_pattern (gomp_target *target, tree *group_size_p)
+{
+  if (gimple_omp_target_kind (target) != GF_OMP_TARGET_KIND_REGION)
+    return false;
+
+  location_t tloc = gimple_location (target);
+  gimple *stmt = find_single_omp_among_assignments (gimple_omp_body (target),
+						    tloc, "target");
+  if (!stmt)
+    return false;
+  gomp_teams *teams = dyn_cast <gomp_teams *> (stmt);
+  tree group_size = NULL;
+  if (!teams)
+    {
+      dump_printf_loc (MSG_NOTE, tloc,
+		       "Will not turn target construct into a simple "
+		       "GPGPU kernel because it does not have a sole teams "
+		       "construct in it.\n");
+      return false;
+    }
+
+  tree clauses = gimple_omp_teams_clauses (teams);
+  while (clauses)
+    {
+      switch (OMP_CLAUSE_CODE (clauses))
+	{
+	case OMP_CLAUSE_NUM_TEAMS:
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a "
+			     "gridified GPGPU kernel because we cannot "
+			     "handle num_teams clause of teams "
+			     "construct\n ");
+	  return false;
+
+	case OMP_CLAUSE_REDUCTION:
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a "
+			     "gridified GPGPU kernel because a reduction "
+			     "clause is present\n ");
+	  return false;
+
+	case OMP_CLAUSE_THREAD_LIMIT:
+	  group_size = OMP_CLAUSE_OPERAND (clauses, 0);
+	  break;
+
+	default:
+	  break;
+	}
+      clauses = OMP_CLAUSE_CHAIN (clauses);
+    }
+
+  stmt = find_single_omp_among_assignments (gimple_omp_body (teams), tloc,
+					    "teams");
+  if (!stmt)
+    return false;
+  gomp_for *dist = dyn_cast <gomp_for *> (stmt);
+  if (!dist)
+    {
+      dump_printf_loc (MSG_NOTE, tloc,
+		       "Will not turn target construct into a simple "
+		       "GPGPU kernel because the teams construct  does not have "
+		       "a sole distribute construct in it.\n");
+      return false;
+    }
+
+  gcc_assert (gimple_omp_for_kind (dist) == GF_OMP_FOR_KIND_DISTRIBUTE);
+  if (!gimple_omp_for_combined_p (dist))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, tloc,
+			 "Will not turn target construct into a gridified GPGPU "
+			 "kernel because we cannot handle a standalone "
+			 "distribute construct\n ");
+      return false;
+    }
+  if (dist->collapse > 1)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, tloc,
+			 "Will not turn target construct into a gridified GPGPU "
+			 "kernel because the distribute construct contains "
+			 "collapse clause\n");
+      return false;
+    }
+  struct omp_for_data fd;
+  extract_omp_for_data (dist, &fd, NULL);
+  if (fd.chunk_size)
+    {
+      if (group_size && !operand_equal_p (group_size, fd.chunk_size, 0))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a "
+			     "gridified GPGPU kernel because the teams "
+			     "thread limit is different from distribute "
+			     "schedule chunk\n");
+	  return false;
+	}
+      group_size = fd.chunk_size;
+    }
+  stmt = find_single_omp_among_assignments (gimple_omp_body (dist), tloc,
+					    "distribute");
+  gomp_parallel *par;
+  if (!stmt || !(par = dyn_cast <gomp_parallel *> (stmt)))
+    return false;
+
+  clauses = gimple_omp_parallel_clauses (par);
+  while (clauses)
+    {
+      switch (OMP_CLAUSE_CODE (clauses))
+	{
+	case OMP_CLAUSE_NUM_THREADS:
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a gridified"
+			     "GPGPU kernel because there is a num_threads "
+			     "clause of the parallel construct\n");
+	  return false;
+	case OMP_CLAUSE_REDUCTION:
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a "
+			     "gridified GPGPU kernel because a reduction "
+			     "clause is present\n ");
+	  return false;
+	default:
+	  break;
+	}
+      clauses = OMP_CLAUSE_CHAIN (clauses);
+    }
+
+  stmt = find_single_omp_among_assignments (gimple_omp_body (par), tloc,
+					    "parallel");
+  gomp_for *gfor;
+  if (!stmt || !(gfor = dyn_cast <gomp_for *> (stmt)))
+    return false;
+
+  if (gimple_omp_for_kind (gfor) != GF_OMP_FOR_KIND_FOR)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, tloc,
+			 "Will not turn target construct into a gridified GPGPU "
+			 "kernel because the inner loop is not a simple for "
+			 "loop\n");
+      return false;
+    }
+  if (gfor->collapse > 1)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, tloc,
+			 "Will not turn target construct into a gridified GPGPU "
+			 "kernel because the inner loop contains collapse "
+			 "clause\n");
+      return false;
+    }
+
+  if (!seq_only_contains_local_assignments (gimple_omp_for_pre_body (gfor)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, tloc,
+			 "Will not turn target construct into a gridified GPGPU "
+			 "kernel because the inner loop pre_body contains"
+			 "a complex instruction\n");
+      return false;
+    }
+
+  clauses = gimple_omp_for_clauses (gfor);
+  while (clauses)
+    {
+      switch (OMP_CLAUSE_CODE (clauses))
+	{
+	case OMP_CLAUSE_SCHEDULE:
+	  if (OMP_CLAUSE_SCHEDULE_KIND (clauses) != OMP_CLAUSE_SCHEDULE_AUTO)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, tloc,
+				 "Will not turn target construct into a "
+				 "gridified GPGPU kernel because the inner "
+				 "loop has a non-automatic scheduling clause\n");
+	      return false;
+	    }
+	  break;
+
+	case OMP_CLAUSE_REDUCTION:
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a "
+			     "gridified GPGPU kernel because a reduction "
+			     "clause is present\n ");
+	  return false;
+
+	default:
+	  break;
+	}
+      clauses = OMP_CLAUSE_CHAIN (clauses);
+    }
+
+  struct walk_stmt_info wi;
+  memset (&wi, 0, sizeof (wi));
+  if (gimple *bad = walk_gimple_seq (gimple_omp_body (gfor),
+				     find_ungridifiable_statement,
+				     NULL, &wi))
+    {
+      if (dump_enabled_p ())
+	{
+	  if (is_gimple_call (bad))
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a gridified "
+			     " GPGPU kernel because the inner loop contains "
+			     "call to a noreturn function\n");
+	  else
+	    dump_printf_loc (MSG_NOTE, tloc,
+			     "Will not turn target construct into a gridified "
+			     "GPGPU kernel because the inner loop contains "
+			     "statement %s which cannot be transformed\n",
+			     gimple_code_name[(int) gimple_code (bad)]);
+	}
+      return false;
+    }
+
+  *group_size_p = group_size;
+  return true;
+}
+
+/* Operand walker, used to remap pre-body declarations according to a hash map
+   provided in DATA.  */
+
+static tree
+remap_prebody_decls (tree *tp, int *walk_subtrees, void *data)
+{
+  tree t = *tp;
+
+  if (DECL_P (t) || TYPE_P (t))
+    *walk_subtrees = 0;
+  else
+    *walk_subtrees = 1;
+
+  if (TREE_CODE (t) == VAR_DECL)
+    {
+      struct walk_stmt_info *wi = (struct walk_stmt_info *) data;
+      hash_map<tree, tree> *declmap = (hash_map<tree, tree> *) wi->info;
+      tree *repl = declmap->get (t);
+      if (repl)
+	*tp = *repl;
+    }
+  return NULL_TREE;
+}
+
+/* Copy leading register-type assignments to local variables in SRC to just
+   before DST, Creating temporaries, adjusting mapping of operands in WI and
+   remapping operands as necessary.  Add any new temporaries to TGT_BIND.
+   Return the first statement that does not conform to
+   reg_assignment_to_local_var_p or NULL.  */
+
+static gimple *
+copy_leading_local_assignments (gimple_seq src, gimple_stmt_iterator *dst,
+				gbind *tgt_bind, struct walk_stmt_info *wi)
+{
+  hash_map<tree, tree> *declmap = (hash_map<tree, tree> *) wi->info;
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start (src); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      if (gbind *bind = dyn_cast <gbind *> (stmt))
+	{
+	  gimple *r = copy_leading_local_assignments (gimple_bind_body (bind),
+						      dst, tgt_bind, wi);
+	  if (r)
+	    return r;
+	  else
+	    continue;
+	}
+      if (!reg_assignment_to_local_var_p (stmt))
+	return stmt;
+      tree lhs = gimple_assign_lhs (as_a <gassign *> (stmt));
+      tree repl = copy_var_decl (lhs, create_tmp_var_name (NULL),
+				 TREE_TYPE (lhs));
+      DECL_CONTEXT (repl) = current_function_decl;
+      gimple_bind_append_vars (tgt_bind, repl);
+
+      declmap->put (lhs, repl);
+      gassign *copy = as_a <gassign *> (gimple_copy (stmt));
+      walk_gimple_op (copy, remap_prebody_decls, wi);
+      gsi_insert_before (dst, copy, GSI_SAME_STMT);
+    }
+  return NULL;
+}
+
+/* Given freshly copied top level kernel SEQ, identify the individual OMP
+   components, mark them as part of kernel and return the inner loop, and copy
+   assignment leading to them just before DST, remapping them using WI and
+   adding new temporaries to TGT_BIND.  */
+
+static gomp_for *
+process_kernel_body_copy (gimple_seq seq, gimple_stmt_iterator *dst,
+			  gbind *tgt_bind, struct walk_stmt_info *wi)
+{
+  gimple *stmt = copy_leading_local_assignments (seq, dst, tgt_bind, wi);
+  gomp_teams *teams = dyn_cast <gomp_teams *> (stmt);
+  gcc_assert (teams);
+  gimple_omp_teams_set_kernel_phony (teams, true);
+  stmt = copy_leading_local_assignments (gimple_omp_body (teams), dst,
+					 tgt_bind, wi);
+  gcc_checking_assert (stmt);
+  gomp_for *dist = dyn_cast <gomp_for *> (stmt);
+  gcc_assert (dist);
+  gimple_seq prebody = gimple_omp_for_pre_body (dist);
+  if (prebody)
+    copy_leading_local_assignments (prebody, dst, tgt_bind, wi);
+  gimple_omp_for_set_kernel_phony (dist, true);
+  stmt = copy_leading_local_assignments (gimple_omp_body (dist), dst,
+					 tgt_bind, wi);
+  gcc_checking_assert (stmt);
+
+  gomp_parallel *parallel = as_a <gomp_parallel *> (stmt);
+  gimple_omp_parallel_set_kernel_phony (parallel, true);
+  stmt = copy_leading_local_assignments (gimple_omp_body (parallel), dst,
+					 tgt_bind, wi);
+  gomp_for *inner_loop = as_a <gomp_for *> (stmt);
+  gimple_omp_for_set_kind (inner_loop, GF_OMP_FOR_KIND_KERNEL_BODY);
+  prebody = gimple_omp_for_pre_body (inner_loop);
+  if (prebody)
+    copy_leading_local_assignments (prebody, dst, tgt_bind, wi);
+
+  return inner_loop;
+}
+
+/* If TARGET points to a GOMP_TARGET which follows a gridifiable pattern,
+   create a GPU kernel for it.  GSI must point to the same statement, TGT_BIND
+   is the bind into which temporaries inserted before TARGET should be
+   added.  */
+
+static void
+attempt_target_gridification (gomp_target *target, gimple_stmt_iterator *gsi,
+			      gbind *tgt_bind)
+{
+  tree group_size;
+  if (!target || !target_follows_gridifiable_pattern (target, &group_size))
+    return;
+
+  location_t loc = gimple_location (target);
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, loc,
+		     "Target construct will be turned into a gridified GPGPU "
+		     "kernel\n");
+
+  /* Copy target body to a GPUKERNEL construct:  */
+  gimple_seq kernel_seq = copy_gimple_seq_and_replace_locals
+    (gimple_omp_body (target));
+
+  hash_map<tree, tree> *declmap = new hash_map<tree, tree>;
+  struct walk_stmt_info wi;
+  memset (&wi, 0, sizeof (struct walk_stmt_info));
+  wi.info = declmap;
+
+  /* Copy assignments in between OMP statements before target, mark OMP
+     statements within copy appropriatly.  */
+  gomp_for *inner_loop = process_kernel_body_copy (kernel_seq, gsi, tgt_bind,
+						   &wi);
+
+  gbind *old_bind = as_a <gbind *> (gimple_seq_first (gimple_omp_body (target)));
+  gbind *new_bind = as_a <gbind *> (gimple_seq_first (kernel_seq));
+  tree new_block = gimple_bind_block (new_bind);
+  tree enc_block = BLOCK_SUPERCONTEXT (gimple_bind_block (old_bind));
+  BLOCK_CHAIN (new_block) = BLOCK_SUBBLOCKS (enc_block);
+  BLOCK_SUBBLOCKS (enc_block) = new_block;
+  BLOCK_SUPERCONTEXT (new_block) = enc_block;
+  gimple *gpukernel = gimple_build_omp_gpukernel (kernel_seq);
+  gimple_seq_add_stmt
+    (gimple_bind_body_ptr (as_a <gbind *> (gimple_omp_body (target))),
+     gpukernel);
+
+  walk_tree (&group_size, remap_prebody_decls, &wi, NULL);
+  size_t collapse = gimple_omp_for_collapse (inner_loop);
+  gimple_omp_target_init_dimensions (target, collapse);
+  for (size_t i = 0; i < collapse; i++)
+    {
+      gimple_omp_for_iter iter = inner_loop->iter[i];
+      walk_tree (&iter.initial, remap_prebody_decls, &wi, NULL);
+      walk_tree (&iter.final, remap_prebody_decls, &wi, NULL);
+
+      tree itype, type = TREE_TYPE (iter.index);
+      if (POINTER_TYPE_P (type))
+	itype = signed_type_for (type);
+      else
+	itype = type;
+
+      enum tree_code cond_code = iter.cond;
+      tree n1 = iter.initial;
+      tree n2 = iter.final;
+      adjust_for_condition (loc, &cond_code, &n2);
+      tree step = get_omp_for_step_from_incr (loc, iter.incr);
+      n1 = force_gimple_operand_gsi (gsi, fold_convert (type, n1), true,
+				     NULL_TREE, true, GSI_SAME_STMT);
+      n2 = force_gimple_operand_gsi (gsi, fold_convert (itype, n2), true,
+				     NULL_TREE,
+				     true, GSI_SAME_STMT);
+      tree t = build_int_cst (itype, (cond_code == LT_EXPR ? -1 : 1));
+      t = fold_build2 (PLUS_EXPR, itype, step, t);
+      t = fold_build2 (PLUS_EXPR, itype, t, n2);
+      t = fold_build2 (MINUS_EXPR, itype, t, fold_convert (itype, n1));
+      if (TYPE_UNSIGNED (itype) && cond_code == GT_EXPR)
+	t = fold_build2 (TRUNC_DIV_EXPR, itype,
+			 fold_build1 (NEGATE_EXPR, itype, t),
+			 fold_build1 (NEGATE_EXPR, itype, step));
+      else
+	t = fold_build2 (TRUNC_DIV_EXPR, itype, t, step);
+      t = fold_convert (uint32_type_node, t);
+      tree gs = force_gimple_operand_gsi (gsi, t, true, NULL_TREE, true,
+					  GSI_SAME_STMT);
+      gimple_omp_target_set_grid_size (target, i, gs);
+      tree ws;
+      if (i == 0 && group_size)
+	{
+	  ws = fold_convert (uint32_type_node, group_size);
+	  ws = force_gimple_operand_gsi (gsi, ws, true, NULL_TREE, true,
+					 GSI_SAME_STMT);
+	}
+      else
+	ws = build_zero_cst (uint32_type_node);
+      gimple_omp_target_set_workgroup_size (target, i, ws);
+    }
+
+  delete declmap;
+  return;
+}
+
+/* Walker function doing all the work for create_target_kernels. */
+
+static tree
+create_target_gpukernel_stmt (gimple_stmt_iterator *gsi, bool *handled_ops_p,
+			      struct walk_stmt_info *incoming)
+{
+  *handled_ops_p = false;
+
+  gimple *stmt = gsi_stmt (*gsi);
+  gomp_target *target = dyn_cast <gomp_target *> (stmt);
+  if (target)
+    {
+      gbind *tgt_bind = (gbind *) incoming->info;
+      gcc_checking_assert (tgt_bind);
+      attempt_target_gridification (target, gsi, tgt_bind);
+      return NULL_TREE;
+    }
+  gbind *bind = dyn_cast <gbind *> (stmt);
+  if (bind)
+    {
+      *handled_ops_p = true;
+      struct walk_stmt_info wi;
+      memset (&wi, 0, sizeof (wi));
+      wi.info = bind;
+      walk_gimple_seq_mod (gimple_bind_body_ptr (bind),
+			   create_target_gpukernel_stmt, NULL, &wi);
+    }
+  return NULL_TREE;
+}
+
+/* Prepare all target constructs in BODY_P for GPU kernel generation, if they
+   follow a gridifiable pattern.  All such targets will have their bodies
+   duplicated, with the new copy being put into a gpukernel.  All
+   kernel-related construct within the gpukernel will be marked with phony
+   flags or kernel kinds.  Moreover, some re-structuring is often needed, such
+   as copying pre-bodies before the target construct so that kernel grid sizes
+   can be computed.  */
+
+static void
+create_target_gpukernels (gimple_seq *body_p)
+{
+  struct walk_stmt_info wi;
+  memset (&wi, 0, sizeof (wi));
+  walk_gimple_seq_mod (body_p, create_target_gpukernel_stmt, NULL, &wi);
+}
+
+
 /* Main entry point.  */
 
 static unsigned int
@@ -16344,6 +17448,10 @@ execute_lower_omp (void)
 				 delete_omp_context);
 
   body = gimple_body (current_function_decl);
+
+  if (hsa_gen_requested_p () && !flag_disable_hsa_gridification)
+    create_target_gpukernels (&body);
+
   scan_omp (&body, NULL);
   gcc_assert (taskreg_nesting_level == 0);
   FOR_EACH_VEC_ELT (taskreg_contexts, i, ctx)
@@ -16681,6 +17789,7 @@ make_gimple_omp_edges (basic_block bb, struct omp_region **region,
     case GIMPLE_OMP_TASKGROUP:
     case GIMPLE_OMP_CRITICAL:
     case GIMPLE_OMP_SECTION:
+    case GIMPLE_OMP_GPUKERNEL:
       cur_region = new_omp_region (bb, code, cur_region);
       fallthru = true;
       break;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 5/12] New HSA-related GCC options
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (3 preceding siblings ...)
  2015-11-05 21:57 ` [hsa 4/12] OpenMP lowering/expansion changes (gridification) Martin Jambor
@ 2015-11-05 21:58 ` Martin Jambor
  2015-11-05 22:48   ` Joseph Myers
  2015-11-06  8:42   ` Richard Biener
  2015-11-05 21:59 ` [hsa 6/12] IPA-HSA pass Martin Jambor
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 21:58 UTC (permalink / raw)
  To: GCC Patches; +Cc: Richard Biener

Hi,

the following small part of the merge deals with new options.  It adds
four independent things:

1) flag_disable_hsa is used by code in opts.c (in the first patch) to
   remember whether HSA has been explicitely disabled on the compiler
   command line.

2) -Whsa is a new warning we emit whenever we fail to produce HSAIL
   for some source code.  It is on by default but of course only
   emitted by HSAIL generating code so should never affect anybody who
   does not use HSA-enabled compiler and OpenMP 4 device constructs.

We have found the following two additions very useful for debugging on
the branch but will understand if they are not deemed suitable for
trunk and will gladly remove them:

3) -fdisable-hsa-gridification disables the gridification process to
   ease experimenting with dynamic parallelism.  With this option,
   HSAIL is always generated from the CPU-intended gimple.

4) Parameter hsa-gen-debug-stores will be obsolete once HSA run-time
   supports debugging traps.  Before that, we have to do with
   debugging stores to memory at defined places, which however can
   cost speed in benchmarks.  So they are only enabled with this
   parameter.  We decided to make it a parameter rather than a switch
   to emphasize the fact it will go away and to possibly allow us
   select different levels of verbosity of the stores in the future).

Thanks,

Martin


2015-11-05  Martin Jambor  <mjambor@suse.cz>

	* common.opt (disable_hsa): New variable.
	(-Whsa): New warning.
	(-fdisable-hsa-gridification): New option.
	* params.def (PARAM_HSA_GEN_DEBUG_STORES): New parameter.

diff --git a/gcc/common.opt b/gcc/common.opt
index 961a1b6..9cb52db 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -223,6 +223,10 @@ unsigned int flag_sanitize_recover = SANITIZE_UNDEFINED | SANITIZE_NONDEFAULT |
 Variable
 bool dump_base_name_prefixed = false
 
+; Flag whether HSA generation has been explicitely disabled
+Variable
+bool flag_disable_hsa = false
+
 ###
 Driver
 
@@ -577,6 +581,10 @@ Wfree-nonheap-object
 Common Var(warn_free_nonheap_object) Init(1) Warning
 Warn when attempting to free a non-heap object.
 
+Whsa
+Common Var(warn_hsa) Init(1) Warning
+Warn when a function cannot be expanded to HSAIL.
+
 Winline
 Common Var(warn_inline) Warning
 Warn when an inlined function cannot be inlined.
@@ -1107,6 +1115,10 @@ fdiagnostics-show-location=
 Common Joined RejectNegative Enum(diagnostic_prefixing_rule)
 -fdiagnostics-show-location=[once|every-line]	How often to emit source location at the beginning of line-wrapped diagnostics.
 
+fdisable-hsa-gridification
+Common Report Var(flag_disable_hsa_gridification)
+Disable HSA gridification for OMP pragmas
+
 ; Required for these enum values.
 SourceInclude
 pretty-print.h
diff --git a/gcc/params.def b/gcc/params.def
index c5d96e7..86911e2 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1177,6 +1177,11 @@ DEFPARAM (PARAM_MAX_SSA_NAME_QUERY_DEPTH,
 	  "Maximum recursion depth allowed when querying a property of an"
 	  " SSA name.",
 	  2, 1, 0)
+
+DEFPARAM (PARAM_HSA_GEN_DEBUG_STORES,
+	  "hsa-gen-debug-stores",
+	  "Level of hsa debug stores verbosity",
+	  0, 0, 1)
 /*
 
 Local variables:

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 6/12] IPA-HSA pass
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (4 preceding siblings ...)
  2015-11-05 21:58 ` [hsa 5/12] New HSA-related GCC options Martin Jambor
@ 2015-11-05 21:59 ` Martin Jambor
  2015-11-05 22:01 ` [hsa 7/12] Disabling the vectorizer for GPU kernels/functions Martin Jambor
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 21:59 UTC (permalink / raw)
  To: GCC Patches; +Cc: Martin Liska, Jan Hubicka

Hi,

when a target construct is gridified, the HSA GPU function is
associated with the CPU function throughout the compilation, so that
they can be registered as a pair in libgomp.

When a target or a parallel construct is not gridified, its body
emerges out of OMP expansion as one gimple function.  However, at some
point we need to create a special HSA function representation so that
we can modify behavior of a (very) few optimization passes for them.
Similarly, "omp declare target" functions, which ought to be callable
from HSA, should get their own representation for exactly the same
reason.

Both is done by the following new IPA pass, which creates new HSA
clones in these cases.  Moreover, it redirects the appropriate call
graph edges to be in between HSA implementations, marks HSA clones
with the flatten attribute to minimize any call overhead (which is
much more significant on GPUs) and makes sure both the CPU and GPU
functions are coupled together and remain in the same LTO partition so
that they can b registered together to libgomp.

Thanks,

Martin


2015-11-05  Martin Liska  <mliska@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	* ipa-hsa.c: New file.


diff --git a/gcc/ipa-hsa.c b/gcc/ipa-hsa.c
new file mode 100644
index 0000000..b4cb58e
--- /dev/null
+++ b/gcc/ipa-hsa.c
@@ -0,0 +1,334 @@
+/* Callgraph based analysis of static variables.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   Contributed by Martin Liska <mliska@suse.cz>
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* Interprocedural HSA pass is responsible for creation of HSA clones.
+   For all these HSA clones, we emit HSAIL instructions and pass processing
+   is terminated.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "is-a.h"
+#include "hash-set.h"
+#include "vec.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "function.h"
+#include "basic-block.h"
+#include "gimple.h"
+#include "dumpfile.h"
+#include "gimple-pretty-print.h"
+#include "tree-streamer.h"
+#include "stringpool.h"
+#include "cgraph.h"
+#include "print-tree.h"
+#include "symbol-summary.h"
+#include "hsa.h"
+
+namespace {
+
+/* If NODE is not versionable, warn about not emiting HSAIL and return false.
+   Otherwise return true.  */
+
+static bool
+check_warn_node_versionable (cgraph_node *node)
+{
+  if (!node->local.versionable)
+    {
+      if (warning_at (EXPR_LOCATION (node->decl), OPT_Whsa,
+		      HSA_SORRY_MSG))
+	inform (EXPR_LOCATION (node->decl),
+		"Function cannot be cloned");
+      return false;
+    }
+  return true;
+}
+
+/* The function creates HSA clones for all functions that were either
+   marked as HSA kernels or are callable HSA functions.  Apart from that,
+   we redirect all edges that come from an HSA clone and end in another
+   HSA clone to connect these two functions.  */
+
+static unsigned int
+process_hsa_functions (void)
+{
+  struct cgraph_node *node;
+
+  if (hsa_summaries == NULL)
+    hsa_summaries = new hsa_summary_t (symtab);
+
+  FOR_EACH_DEFINED_FUNCTION (node)
+    {
+      hsa_function_summary *s = hsa_summaries->get (node);
+
+      /* A linked function is skipped.  */
+      if (s->m_binded_function != NULL)
+	continue;
+
+      if (s->m_kind != HSA_NONE)
+	{
+	  if (!check_warn_node_versionable (node))
+	    continue;
+	  cgraph_node *clone = node->create_virtual_clone
+	    (vec <cgraph_edge *> (), NULL, NULL, "hsa");
+	  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
+	  if (s->m_kind == HSA_KERNEL)
+	    DECL_ATTRIBUTES (clone->decl)
+	      = tree_cons (get_identifier ("flatten"), NULL_TREE,
+			   DECL_ATTRIBUTES (clone->decl));
+
+	  clone->force_output = true;
+	  hsa_summaries->link_functions (clone, node, s->m_kind);
+
+	  if (dump_file)
+	    fprintf (dump_file, "HSA creates a new clone: %s, type: %s\n",
+		     clone->name (),
+		     s->m_kind == HSA_KERNEL ? "kernel" : "function");
+	}
+      else if (hsa_callable_function_p (node->decl))
+	{
+	  if (!check_warn_node_versionable (node))
+	    continue;
+	  cgraph_node *clone = node->create_virtual_clone
+	    (vec <cgraph_edge *> (), NULL, NULL, "hsa");
+	  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
+
+	  if (!cgraph_local_p (node))
+	    clone->force_output = true;
+	  hsa_summaries->link_functions (clone, node, HSA_FUNCTION);
+
+	  if (dump_file)
+	    fprintf (dump_file, "HSA creates a new function clone: %s\n",
+		     clone->name ());
+	}
+    }
+
+  /* Redirect all edges that are between HSA clones.  */
+  FOR_EACH_DEFINED_FUNCTION (node)
+    {
+      cgraph_edge *e = node->callees;
+
+      while (e)
+	{
+	  hsa_function_summary *src = hsa_summaries->get (node);
+	  if (src->m_kind != HSA_NONE && src->m_gpu_implementation_p)
+	    {
+	      hsa_function_summary *dst = hsa_summaries->get (e->callee);
+	      if (dst->m_kind != HSA_NONE && !dst->m_gpu_implementation_p)
+		{
+		  e->redirect_callee (dst->m_binded_function);
+		  if (dump_file)
+		    fprintf (dump_file,
+			     "Redirecting edge to HSA function: %s->%s\n",
+			     xstrdup_for_dump (e->caller->name ()),
+			     xstrdup_for_dump (e->callee->name ()));
+		}
+	    }
+
+	  e = e->next_callee;
+	}
+    }
+
+  return 0;
+}
+
+/* Iterate all HSA functions and stream out HSA function summary.  */
+
+static void
+ipa_hsa_write_summary (void)
+{
+  struct bitpack_d bp;
+  struct cgraph_node *node;
+  struct output_block *ob;
+  unsigned int count = 0;
+  lto_symtab_encoder_iterator lsei;
+  lto_symtab_encoder_t encoder;
+
+  if (!hsa_summaries)
+    return;
+
+  ob = create_output_block (LTO_section_ipa_hsa);
+  encoder = ob->decl_state->symtab_node_encoder;
+  ob->symbol = NULL;
+  for (lsei = lsei_start_function_in_partition (encoder); !lsei_end_p (lsei);
+       lsei_next_function_in_partition (&lsei))
+    {
+      node = lsei_cgraph_node (lsei);
+      hsa_function_summary *s = hsa_summaries->get (node);
+
+      if (s->m_kind != HSA_NONE)
+	count++;
+    }
+
+  streamer_write_uhwi (ob, count);
+
+  /* Process all of the functions.  */
+  for (lsei = lsei_start_function_in_partition (encoder); !lsei_end_p (lsei);
+       lsei_next_function_in_partition (&lsei))
+    {
+      node = lsei_cgraph_node (lsei);
+      hsa_function_summary *s = hsa_summaries->get (node);
+
+      if (s->m_kind != HSA_NONE)
+	{
+	  encoder = ob->decl_state->symtab_node_encoder;
+	  int node_ref = lto_symtab_encoder_encode (encoder, node);
+	  streamer_write_uhwi (ob, node_ref);
+
+	  bp = bitpack_create (ob->main_stream);
+	  bp_pack_value (&bp, s->m_kind, 2);
+	  bp_pack_value (&bp, s->m_gpu_implementation_p, 1);
+	  bp_pack_value (&bp, s->m_binded_function != NULL, 1);
+	  streamer_write_bitpack (&bp);
+	  if (s->m_binded_function)
+	    stream_write_tree (ob, s->m_binded_function->decl, true);
+	}
+    }
+
+  streamer_write_char_stream (ob->main_stream, 0);
+  produce_asm (ob, NULL);
+  destroy_output_block (ob);
+}
+
+/* Read section in file FILE_DATA of length LEN with data DATA.  */
+
+static void
+ipa_hsa_read_section (struct lto_file_decl_data *file_data, const char *data,
+		       size_t len)
+{
+  const struct lto_function_header *header =
+    (const struct lto_function_header *) data;
+  const int cfg_offset = sizeof (struct lto_function_header);
+  const int main_offset = cfg_offset + header->cfg_size;
+  const int string_offset = main_offset + header->main_size;
+  struct data_in *data_in;
+  unsigned int i;
+  unsigned int count;
+
+  lto_input_block ib_main ((const char *) data + main_offset,
+			   header->main_size, file_data->mode_table);
+
+  data_in =
+    lto_data_in_create (file_data, (const char *) data + string_offset,
+			header->string_size, vNULL);
+  count = streamer_read_uhwi (&ib_main);
+
+  for (i = 0; i < count; i++)
+    {
+      unsigned int index;
+      struct cgraph_node *node;
+      lto_symtab_encoder_t encoder;
+
+      index = streamer_read_uhwi (&ib_main);
+      encoder = file_data->symtab_node_encoder;
+      node = dyn_cast<cgraph_node *> (lto_symtab_encoder_deref (encoder,
+								index));
+      gcc_assert (node->definition);
+      hsa_function_summary *s = hsa_summaries->get (node);
+
+      struct bitpack_d bp = streamer_read_bitpack (&ib_main);
+      s->m_kind = (hsa_function_kind) bp_unpack_value (&bp, 2);
+      s->m_gpu_implementation_p = bp_unpack_value (&bp, 1);
+      bool has_tree = bp_unpack_value (&bp, 1);
+
+      if (has_tree)
+	{
+	  tree decl = stream_read_tree (&ib_main, data_in);
+	  s->m_binded_function = cgraph_node::get_create (decl);
+	}
+    }
+  lto_free_section_data (file_data, LTO_section_ipa_hsa, NULL, data,
+			 len);
+  lto_data_in_delete (data_in);
+}
+
+/* Load streamed HSA functions summary and assign the summary to a function.  */
+
+static void
+ipa_hsa_read_summary (void)
+{
+  struct lto_file_decl_data **file_data_vec = lto_get_file_decl_data ();
+  struct lto_file_decl_data *file_data;
+  unsigned int j = 0;
+
+  if (hsa_summaries == NULL)
+    hsa_summaries = new hsa_summary_t (symtab);
+
+  while ((file_data = file_data_vec[j++]))
+    {
+      size_t len;
+      const char *data = lto_get_section_data (file_data, LTO_section_ipa_hsa,
+					       NULL, &len);
+
+      if (data)
+	ipa_hsa_read_section (file_data, data, len);
+    }
+}
+
+const pass_data pass_data_ipa_hsa =
+{
+  IPA_PASS, /* type */
+  "hsa", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_IPA_HSA, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_dump_symtab, /* todo_flags_finish */
+};
+
+class pass_ipa_hsa : public ipa_opt_pass_d
+{
+public:
+  pass_ipa_hsa (gcc::context *ctxt)
+    : ipa_opt_pass_d (pass_data_ipa_hsa, ctxt,
+		      NULL, /* generate_summary */
+		      ipa_hsa_write_summary, /* write_summary */
+		      ipa_hsa_read_summary, /* read_summary */
+		      ipa_hsa_write_summary, /* write_optimization_summary */
+		      ipa_hsa_read_summary, /* read_optimization_summary */
+		      NULL, /* stmt_fixup */
+		      0, /* function_transform_todo_flags_start */
+		      NULL, /* function_transform */
+		      NULL) /* variable_transform */
+    {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *);
+
+  virtual unsigned int execute (function *) { return process_hsa_functions (); }
+
+}; // class pass_ipa_reference
+
+bool
+pass_ipa_hsa::gate (function *)
+{
+  return hsa_gen_requested_p () || in_lto_p;
+}
+
+} // anon namespace
+
+ipa_opt_pass_d *
+make_pass_ipa_hsa (gcc::context *ctxt)
+{
+  return new pass_ipa_hsa (ctxt);
+}
diff --git a/gcc/lto-section-in.c b/gcc/lto-section-in.c
index e7ace09..840e26b 100644
--- a/gcc/lto-section-in.c
+++ b/gcc/lto-section-in.c
@@ -51,7 +51,8 @@ const char *lto_section_name[LTO_N_SECTION_TYPES] =
   "ipcp_trans",
   "icf",
   "offload_table",
-  "mode_table"
+  "mode_table",
+  "hsa"
 };
 
 
diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h
index 5aae9e9..b29ff18 100644
--- a/gcc/lto-streamer.h
+++ b/gcc/lto-streamer.h
@@ -244,6 +244,7 @@ enum lto_section_type
   LTO_section_ipa_icf,
   LTO_section_offload_table,
   LTO_section_mode_table,
+  LTO_section_ipa_hsa,
   LTO_N_SECTION_TYPES		/* Must be last.  */
 };
 
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
index 03ed72b..a966014 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto/lto-partition.c
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-inline.h"
 #include "ipa-utils.h"
 #include "lto-partition.h"
+#include "hsa.h"
 
 vec<ltrans_partition> ltrans_partitions;
 
@@ -177,6 +178,24 @@ add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node)
 	 Therefore put it into the same partition.  */
       if (cnode->instrumented_version)
 	add_symbol_to_partition_1 (part, cnode->instrumented_version);
+
+      /* Add an HSA associated with the symbol.  */
+      if (hsa_summaries != NULL)
+	{
+	  hsa_function_summary *s = hsa_summaries->get (cnode);
+	  if (s->m_kind == HSA_KERNEL)
+	    {
+	      /* Add binded function.  */
+	      bool added = add_symbol_to_partition_1 (part,
+						      s->m_binded_function);
+	      gcc_assert (added);
+	      if (symtab->dump_file)
+		fprintf (symtab->dump_file,
+			 "adding an HSA function (host/gpu) to the "
+			 "partition: %s\n",
+			 s->m_binded_function->name ());
+	    }
+	}
     }
 
   add_references_to_partition (part, node);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 7/12] Disabling the vectorizer for GPU kernels/functions
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (5 preceding siblings ...)
  2015-11-05 21:59 ` [hsa 6/12] IPA-HSA pass Martin Jambor
@ 2015-11-05 22:01 ` Martin Jambor
  2015-11-06  8:38   ` Richard Biener
  2015-11-05 22:02 ` [hsa 8/12] Pass manager changes Martin Jambor
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 22:01 UTC (permalink / raw)
  To: GCC Patches; +Cc: Richard Biener

Hi,

in the previous email I wrote we need to "change behavior" of a few
optimization passes.  One was the flattening of GPU functions and the
other two are in the patch below.  It all comes to that, at the
moment, we need to switch off the vectorizer (only for the GPU
functions, of course).

We are actually quite close to being able to handle gimple vector
input in HSA back-end but not all the way yet, and before allowing the
vectorizer again, we will have to make sure it never produces vectors
bigger than 128bits (in GPU functions).

Thanks,

Martin


2015-11-05  Martin Jambor  <mjambor@suse.cz>

	* tree-ssa-loop.c: Include cgraph.c, symbol-summary.c and hsa.h.
	(pass_vectorize::gate): Do not run on HSA functions.
	* tree-vectorizer.c: Include symbol-summary.c and hsa.h.
	(pass_slp_vectorize::gate): Do not run on HSA functions.

diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 8ecd140..0d119e2 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -35,6 +35,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
+#include "cgraph.h"
+#include "symbol-summary.h"
+#include "hsa.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -257,7 +260,8 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *fun)
     {
-      return flag_tree_loop_vectorize || fun->has_force_vectorize_loops;
+      return (flag_tree_loop_vectorize || fun->has_force_vectorize_loops)
+	&& !hsa_gpu_implementation_p (fun->decl);
     }
 
   virtual unsigned int execute (function *);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index b80a8dd..366138c 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -75,6 +75,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-propagate.h"
 #include "dbgcnt.h"
 #include "tree-scalar-evolution.h"
+#include "symbol-summary.h"
+#include "hsa.h"
 
 
 /* Loop or bb location.  */
@@ -675,7 +677,10 @@ public:
 
   /* opt_pass methods: */
   opt_pass * clone () { return new pass_slp_vectorize (m_ctxt); }
-  virtual bool gate (function *) { return flag_tree_slp_vectorize != 0; }
+  virtual bool gate (function *fun)
+  {
+    return flag_tree_slp_vectorize && !hsa_gpu_implementation_p (fun->decl);
+  }
   virtual unsigned int execute (function *);
 
 }; // class pass_slp_vectorize

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 8/12] Pass manager changes
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (6 preceding siblings ...)
  2015-11-05 22:01 ` [hsa 7/12] Disabling the vectorizer for GPU kernels/functions Martin Jambor
@ 2015-11-05 22:02 ` Martin Jambor
  2015-11-05 22:03 ` [hsa 9/12] Small alloc-pool fix Martin Jambor
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 22:02 UTC (permalink / raw)
  To: GCC Patches

Hi,

the following pathch has been actually already committed to trunk by
Martin yesterday, but is necessary if you have an older trunk.  It
allows an optimization pass to declare the functon finished by
returning TODO_discard_function.  The pass manager will then discard
the function.

Martin


2015-11-04  Martin Liska  <mliska@suse.cz>

        * cgraphunit.c (cgraph_node::expand_thunk): Call 
        allocate_struct_function before init_function_start.
        (cgraph_node::expand): Use push_cfun and pop_cfun.
        * config/i386/i386.c (ix86_code_end): Call
        allocate_struct_function before init_function_start.
        * config/rs6000/rs6000.c (rs6000_code_end): Likewise.
        * function.c (init_function_start): Move preamble to all
        callers.
        * passes.c (do_per_function_toporder): Use push_cfun and pop_cfun.
        (execute_one_pass): Handle newly added TODO_discard_function.
        (execute_pass_list_1): Terminate if cfun equals to NULL.
        (execute_pass_list): Do not push and pop cfun, expect that
        cfun is set.
        * tree-pass.h (TODO_discard_function): Define.

diff --git a/gcc/cgraphunit.c b/gcc/cgraphunit.c
index 43d3185..f73d9a7 100644
--- a/gcc/cgraphunit.c
+++ b/gcc/cgraphunit.c
@@ -1618,6 +1618,7 @@ cgraph_node::expand_thunk (bool output_asm_thunks, bool force_gimple_thunk)
       fn_block = make_node (BLOCK);
       BLOCK_VARS (fn_block) = a;
       DECL_INITIAL (thunk_fndecl) = fn_block;
+      allocate_struct_function (thunk_fndecl, false);
       init_function_start (thunk_fndecl);
       cfun->is_thunk = 1;
       insn_locations_init ();
@@ -1632,7 +1633,6 @@ cgraph_node::expand_thunk (bool output_asm_thunks, bool force_gimple_thunk)
       insn_locations_finalize ();
       init_insn_lengths ();
       free_after_compilation (cfun);
-      set_cfun (NULL);
       TREE_ASM_WRITTEN (thunk_fndecl) = 1;
       thunk.thunk_p = false;
       analyzed = false;
@@ -1944,9 +1944,11 @@ cgraph_node::expand (void)
   bitmap_obstack_initialize (NULL);
 
   /* Initialize the RTL code for the function.  */
-  current_function_decl = decl;
   saved_loc = input_location;
   input_location = DECL_SOURCE_LOCATION (decl);
+
+  gcc_assert (DECL_STRUCT_FUNCTION (decl));
+  push_cfun (DECL_STRUCT_FUNCTION (decl));
   init_function_start (decl);
 
   gimple_register_cfg_hooks ();
@@ -2014,8 +2016,8 @@ cgraph_node::expand (void)
 
   /* Make sure that BE didn't give up on compiling.  */
   gcc_assert (TREE_ASM_WRITTEN (decl));
-  set_cfun (NULL);
-  current_function_decl = NULL;
+  if (cfun)
+    pop_cfun ();
 
   /* It would make a lot more sense to output thunks before function body to get more
      forward and lest backwarding jumps.  This however would need solving problem
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 66024e2..2a965f6 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -10958,6 +10958,7 @@ ix86_code_end (void)
 
       DECL_INITIAL (decl) = make_node (BLOCK);
       current_function_decl = decl;
+      allocate_struct_function (decl, false);
       init_function_start (decl);
       first_function_block_is_cold = false;
       /* Make sure unwind info is emitted for the thunk if needed.  */
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 271c3f9..8bdd646 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -34594,6 +34594,7 @@ rs6000_code_end (void)
 
   DECL_INITIAL (decl) = make_node (BLOCK);
   current_function_decl = decl;
+  allocate_struct_function (decl, false);
   init_function_start (decl);
   first_function_block_is_cold = false;
   /* Make sure unwind info is emitted for the thunk if needed.  */
diff --git a/gcc/function.c b/gcc/function.c
index aaf49a4..0d7cabc 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -4957,11 +4957,6 @@ init_dummy_function_start (void)
 void
 init_function_start (tree subr)
 {
-  if (subr && DECL_STRUCT_FUNCTION (subr))
-    set_cfun (DECL_STRUCT_FUNCTION (subr));
-  else
-    allocate_struct_function (subr, false);
-
   /* Initialize backend, if needed.  */
   initialize_rtl ();
 
diff --git a/gcc/passes.c b/gcc/passes.c
index f87dcf4..08221ed 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -1706,7 +1706,12 @@ do_per_function_toporder (void (*callback) (function *, void *data), void *data)
 	  order[i] = NULL;
 	  node->process = 0;
 	  if (node->has_gimple_body_p ())
-	    callback (DECL_STRUCT_FUNCTION (node->decl), data);
+	    {
+	      struct function *fn = DECL_STRUCT_FUNCTION (node->decl);
+	      push_cfun (fn);
+	      callback (fn, data);
+	      pop_cfun ();
+	    }
 	}
       symtab->remove_cgraph_removal_hook (hook);
     }
@@ -2347,6 +2352,23 @@ execute_one_pass (opt_pass *pass)
 
   current_pass = NULL;
 
+  if (todo_after & TODO_discard_function)
+    {
+      gcc_assert (cfun);
+      /* As cgraph_node::release_body expects release dominators info,
+	 we have to release it.  */
+      if (dom_info_available_p (CDI_DOMINATORS))
+	free_dominance_info (CDI_DOMINATORS);
+
+      if (dom_info_available_p (CDI_POST_DOMINATORS))
+	free_dominance_info (CDI_POST_DOMINATORS);
+
+      tree fn = cfun->decl;
+      pop_cfun ();
+      gcc_assert (!cfun);
+      cgraph_node::get (fn)->release_body ();
+    }
+
   /* Signal this is a suitable GC collection point.  */
   if (!((todo_after | pass->todo_flags_finish) & TODO_do_not_ggc_collect))
     ggc_collect ();
@@ -2361,8 +2383,12 @@ execute_pass_list_1 (opt_pass *pass)
     {
       gcc_assert (pass->type == GIMPLE_PASS
 		  || pass->type == RTL_PASS);
+
+      if (cfun == NULL)
+	return;
       if (execute_one_pass (pass) && pass->sub)
-        execute_pass_list_1 (pass->sub);
+	execute_pass_list_1 (pass->sub);
+
       pass = pass->next;
     }
   while (pass);
@@ -2371,14 +2397,13 @@ execute_pass_list_1 (opt_pass *pass)
 void
 execute_pass_list (function *fn, opt_pass *pass)
 {
-  push_cfun (fn);
+  gcc_assert (fn == cfun);
   execute_pass_list_1 (pass);
-  if (fn->cfg)
+  if (cfun && fn->cfg)
     {
       free_dominance_info (CDI_DOMINATORS);
       free_dominance_info (CDI_POST_DOMINATORS);
     }
-  pop_cfun ();
 }
 
 /* Write out all LTO data.  */
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..27b43df 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -125,6 +125,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_ipa_cp);
   NEXT_PASS (pass_ipa_cdtor_merge);
   NEXT_PASS (pass_target_clone);
+  NEXT_PASS (pass_ipa_hsa);
   NEXT_PASS (pass_ipa_inline);
   NEXT_PASS (pass_ipa_pure_const);
   NEXT_PASS (pass_ipa_reference);
@@ -357,6 +358,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_nrv);
   NEXT_PASS (pass_cleanup_cfg_post_optimizing);
   NEXT_PASS (pass_warn_function_noreturn);
+  NEXT_PASS (pass_gen_hsail);
 
   NEXT_PASS (pass_expand);
 
diff --git a/gcc/timevar.def b/gcc/timevar.def
index b429faf..6fdee0f 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -94,6 +94,7 @@ DEFTIMEVAR (TV_WHOPR_WPA_IO          , "whopr wpa I/O")
 DEFTIMEVAR (TV_WHOPR_PARTITIONING    , "whopr partitioning")
 DEFTIMEVAR (TV_WHOPR_LTRANS          , "whopr ltrans")
 DEFTIMEVAR (TV_IPA_REFERENCE         , "ipa reference")
+DEFTIMEVAR (TV_IPA_HSA		     , "ipa HSA")
 DEFTIMEVAR (TV_IPA_PROFILE           , "ipa profile")
 DEFTIMEVAR (TV_IPA_AUTOFDO           , "auto profile")
 DEFTIMEVAR (TV_IPA_PURE_CONST        , "ipa pure const")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index ba53cca..713f068 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -300,6 +300,9 @@ protected:
 /* Rebuild the callgraph edges.  */
 #define TODO_rebuild_cgraph_edges       (1 << 22)
 
+/* Release function body and stop pass manager.  */
+#define TODO_discard_function		(1 << 23)
+
 /* Internally used in execute_function_todo().  */
 #define TODO_update_ssa_any		\
     (TODO_update_ssa			\
@@ -460,6 +463,7 @@ extern gimple_opt_pass *make_pass_strength_reduction (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_gen_hsail (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
@@ -484,6 +488,7 @@ extern ipa_opt_pass_d *make_pass_ipa_cp (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_icf (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_devirt (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
+extern ipa_opt_pass_d *make_pass_ipa_hsa (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_pure_const (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_pta (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_tm (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 9/12] Small alloc-pool fix
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (7 preceding siblings ...)
  2015-11-05 22:02 ` [hsa 8/12] Pass manager changes Martin Jambor
@ 2015-11-05 22:03 ` Martin Jambor
  2015-11-06  9:00   ` Richard Biener
  2015-11-05 22:05 ` [hsa 10/12] HSAIL BRIG description header file (hopefully not a licensing issue) Martin Jambor
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 22:03 UTC (permalink / raw)
  To: GCC Patches; +Cc: Martin Liska, Richard Biener

Hi,

we use C++ new operators based on alloc-pools a lot in the subsequent
patches and realized that on the current trunk, such new operators
would needlessly call the placement ::new operator within the allocate
method of pool-alloc.  Fixed below by providing a new allocation
method which does not call placement new, which is only safe to use
from within a new operator.

The patch also fixes the slightly weird two parameter operator new
(which we do not use in HSA backend) so that it does not do the same.

Thanks,

Martin


2015-11-05  Martin Liska  <mliska@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	* alloc-pool.h (object_allocator::vallocate): New method.
	(operator new): Call vallocate instead of allocate.
	(operator new): New operator.


diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
index 0dc05cd..46b6550 100644
--- a/gcc/alloc-pool.h
+++ b/gcc/alloc-pool.h
@@ -483,6 +483,12 @@ public:
     return ::new (m_allocator.allocate ()) T ();
   }
 
+  inline void *
+  vallocate () ATTRIBUTE_MALLOC
+  {
+    return m_allocator.allocate ();
+  }
+
   inline void
   remove (T *object)
   {
@@ -523,12 +529,19 @@ struct alloc_pool_descriptor
 };
 
 /* Helper for classes that do not provide default ctor.  */
-
 template <typename T>
 inline void *
 operator new (size_t, object_allocator<T> &a)
 {
-  return a.allocate ();
+  return a.vallocate ();
+}
+
+/* Helper for classes that do not provide default ctor.  */
+template <typename T>
+inline void *
+operator new (size_t, object_allocator<T> *a)
+{
+  return a->vallocate ();
 }
 
 /* Hashtable mapping alloc_pool names to descriptors.  */

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 10/12] HSAIL BRIG description header file (hopefully not a licensing issue)
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (8 preceding siblings ...)
  2015-11-05 22:03 ` [hsa 9/12] Small alloc-pool fix Martin Jambor
@ 2015-11-05 22:05 ` Martin Jambor
  2015-11-06 11:29   ` Bernd Schmidt
  2015-11-05 22:06 ` [hsa 11/12] Majority of the HSA back-end Martin Jambor
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 22:05 UTC (permalink / raw)
  To: GCC Patches

Hi,

the following patch adds a BRIG (binary representation of HSAIL)
representation description.  It is within a single header file
describing the binary structures and constants of the format.


Initially, I have created the file by copying out pieces of PDF
documentation but the latest version of the file (describing final
HSAIL 1.0) is actually taken from the HSAIL (dis)assembler developed
by HSA foundation and released by "University of Illinois/NCSA Open
Source License."

The license is "GPL-compatible" according to FSF
(http://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses)
so I believe that means we can put it inside GCC and I hope I also do
not need any special steering committee approval or whatnot.  At the
same time, the license comes with three restrictions that I hope I
have fulfilled by keeping them in the header comment.  Nevertheless,
if anybody knowledgeable can tell me what is the known right thing to
do (or to confirm this is indeed the right thing to do), I'll be very
happy.

Because this file originated outside GCC and we are likely to repace
it with newer versions as they come along, it does not follow GNU
coding standards.  I hope that is reasonable.

Thanks,

Martin


2015-11-05  Martin Jambor  <mjambor@suse.cz>

        * hsa-brig-format.h: New file.

diff --git a/gcc/hsa-brig-format.h b/gcc/hsa-brig-format.h
new file mode 100644
index 0000000..f099fc6
--- /dev/null
+++ b/gcc/hsa-brig-format.h
@@ -0,0 +1,1283 @@
+/* HSAIL and BRIG related macros and definitions.
+   Copyright (c) 2013-2015, Advanced Micro Devices, Inc.
+   Copyright (C) 2013-2015 Free Software Foundation, Inc.
+
+   Majority of contents in this file has originally been distributed under the
+   University of Illinois/NCSA Open Source License.  This license mandates that
+   the following conditions are observed when distributing this file:
+
+     * Redistributions of source code must retain the above copyright notice,
+       this list of conditions and the following disclaimers.
+
+     * Redistributions in binary form must reproduce the above copyright notice,
+       this list of conditions and the following disclaimers in the
+       documentation and/or other materials provided with the distribution.
+
+     * Neither the names of the HSA Team, HSA Foundation, University of
+       Illinois at Urbana-Champaign, nor the names of its contributors may be
+       used to endorse or promote products derived from this Software without
+       specific prior written permission.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+   CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+   FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+   WITH THE SOFTWARE.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef HSA_BRIG_FORMAT_H
+#define HSA_BRIG_FORMAT_H
+
+#include "config.h"
+#include "system.h"
+
+typedef uint32_t BrigVersion32_t;
+
+enum BrigVersion {
+
+    BRIG_VERSION_HSAIL_MAJOR = 1,
+    BRIG_VERSION_HSAIL_MINOR = 0,
+    BRIG_VERSION_BRIG_MAJOR  = 1,
+    BRIG_VERSION_BRIG_MINOR  = 0
+};
+
+typedef uint8_t BrigAlignment8_t;
+
+typedef uint8_t BrigAllocation8_t;
+
+typedef uint8_t BrigAluModifier8_t;
+
+typedef uint8_t BrigAtomicOperation8_t;
+
+typedef uint32_t BrigCodeOffset32_t;
+
+typedef uint8_t BrigCompareOperation8_t;
+
+typedef uint16_t BrigControlDirective16_t;
+
+typedef uint32_t BrigDataOffset32_t;
+
+typedef BrigDataOffset32_t BrigDataOffsetCodeList32_t;
+
+typedef BrigDataOffset32_t BrigDataOffsetOperandList32_t;
+
+typedef BrigDataOffset32_t BrigDataOffsetString32_t;
+
+typedef uint8_t BrigExecutableModifier8_t;
+
+typedef uint8_t BrigImageChannelOrder8_t;
+
+typedef uint8_t BrigImageChannelType8_t;
+
+typedef uint8_t BrigImageGeometry8_t;
+
+typedef uint8_t BrigImageQuery8_t;
+
+typedef uint16_t BrigKind16_t;
+
+typedef uint8_t BrigLinkage8_t;
+
+typedef uint8_t BrigMachineModel8_t;
+
+typedef uint8_t BrigMemoryModifier8_t;
+
+typedef uint8_t BrigMemoryOrder8_t;
+
+typedef uint8_t BrigMemoryScope8_t;
+
+typedef uint16_t BrigOpcode16_t;
+
+typedef uint32_t BrigOperandOffset32_t;
+
+typedef uint8_t BrigPack8_t;
+
+typedef uint8_t BrigProfile8_t;
+
+typedef uint16_t BrigRegisterKind16_t;
+
+typedef uint8_t BrigRound8_t;
+
+typedef uint8_t BrigSamplerAddressing8_t;
+
+typedef uint8_t BrigSamplerCoordNormalization8_t;
+
+typedef uint8_t BrigSamplerFilter8_t;
+
+typedef uint8_t BrigSamplerQuery8_t;
+
+typedef uint32_t BrigSectionIndex32_t;
+
+typedef uint8_t BrigSegCvtModifier8_t;
+
+typedef uint8_t BrigSegment8_t;
+
+typedef uint32_t BrigStringOffset32_t;
+
+typedef uint16_t BrigType16_t;
+
+typedef uint8_t BrigVariableModifier8_t;
+
+typedef uint8_t BrigWidth8_t;
+
+typedef uint32_t BrigExceptions32_t;
+
+enum BrigKind {
+
+    BRIG_KIND_NONE = 0x0000,
+
+    BRIG_KIND_DIRECTIVE_BEGIN = 0x1000,
+    BRIG_KIND_DIRECTIVE_ARG_BLOCK_END = 0x1000,
+    BRIG_KIND_DIRECTIVE_ARG_BLOCK_START = 0x1001,
+    BRIG_KIND_DIRECTIVE_COMMENT = 0x1002,
+    BRIG_KIND_DIRECTIVE_CONTROL = 0x1003,
+    BRIG_KIND_DIRECTIVE_EXTENSION = 0x1004,
+    BRIG_KIND_DIRECTIVE_FBARRIER = 0x1005,
+    BRIG_KIND_DIRECTIVE_FUNCTION = 0x1006,
+    BRIG_KIND_DIRECTIVE_INDIRECT_FUNCTION = 0x1007,
+    BRIG_KIND_DIRECTIVE_KERNEL = 0x1008,
+    BRIG_KIND_DIRECTIVE_LABEL = 0x1009,
+    BRIG_KIND_DIRECTIVE_LOC = 0x100a,
+    BRIG_KIND_DIRECTIVE_MODULE = 0x100b,
+    BRIG_KIND_DIRECTIVE_PRAGMA = 0x100c,
+    BRIG_KIND_DIRECTIVE_SIGNATURE = 0x100d,
+    BRIG_KIND_DIRECTIVE_VARIABLE = 0x100e,
+    BRIG_KIND_DIRECTIVE_END = 0x100f,
+
+    BRIG_KIND_INST_BEGIN = 0x2000,
+    BRIG_KIND_INST_ADDR = 0x2000,
+    BRIG_KIND_INST_ATOMIC = 0x2001,
+    BRIG_KIND_INST_BASIC = 0x2002,
+    BRIG_KIND_INST_BR = 0x2003,
+    BRIG_KIND_INST_CMP = 0x2004,
+    BRIG_KIND_INST_CVT = 0x2005,
+    BRIG_KIND_INST_IMAGE = 0x2006,
+    BRIG_KIND_INST_LANE = 0x2007,
+    BRIG_KIND_INST_MEM = 0x2008,
+    BRIG_KIND_INST_MEM_FENCE = 0x2009,
+    BRIG_KIND_INST_MOD = 0x200a,
+    BRIG_KIND_INST_QUERY_IMAGE = 0x200b,
+    BRIG_KIND_INST_QUERY_SAMPLER = 0x200c,
+    BRIG_KIND_INST_QUEUE = 0x200d,
+    BRIG_KIND_INST_SEG = 0x200e,
+    BRIG_KIND_INST_SEG_CVT = 0x200f,
+    BRIG_KIND_INST_SIGNAL = 0x2010,
+    BRIG_KIND_INST_SOURCE_TYPE = 0x2011,
+    BRIG_KIND_INST_END = 0x2012,
+
+    BRIG_KIND_OPERAND_BEGIN = 0x3000,
+    BRIG_KIND_OPERAND_ADDRESS = 0x3000,
+    BRIG_KIND_OPERAND_ALIGN = 0x3001,
+    BRIG_KIND_OPERAND_CODE_LIST = 0x3002,
+    BRIG_KIND_OPERAND_CODE_REF = 0x3003,
+    BRIG_KIND_OPERAND_CONSTANT_BYTES = 0x3004,
+    BRIG_KIND_OPERAND_RESERVED = 0x3005,
+    BRIG_KIND_OPERAND_CONSTANT_IMAGE = 0x3006,
+    BRIG_KIND_OPERAND_CONSTANT_OPERAND_LIST = 0x3007,
+    BRIG_KIND_OPERAND_CONSTANT_SAMPLER = 0x3008,
+    BRIG_KIND_OPERAND_OPERAND_LIST = 0x3009,
+    BRIG_KIND_OPERAND_REGISTER = 0x300a,
+    BRIG_KIND_OPERAND_STRING = 0x300b,
+    BRIG_KIND_OPERAND_WAVESIZE = 0x300c,
+    BRIG_KIND_OPERAND_END = 0x300d
+};
+
+enum BrigAlignment {
+
+    BRIG_ALIGNMENT_NONE = 0,
+    BRIG_ALIGNMENT_1 = 1,
+    BRIG_ALIGNMENT_2 = 2,
+    BRIG_ALIGNMENT_4 = 3,
+    BRIG_ALIGNMENT_8 = 4,
+    BRIG_ALIGNMENT_16 = 5,
+    BRIG_ALIGNMENT_32 = 6,
+    BRIG_ALIGNMENT_64 = 7,
+    BRIG_ALIGNMENT_128 = 8,
+    BRIG_ALIGNMENT_256 = 9,
+
+    BRIG_ALIGNMENT_LAST,
+    BRIG_ALIGNMENT_MAX = BRIG_ALIGNMENT_LAST - 1
+};
+
+enum BrigAllocation {
+
+    BRIG_ALLOCATION_NONE = 0,
+    BRIG_ALLOCATION_PROGRAM = 1,
+    BRIG_ALLOCATION_AGENT = 2,
+    BRIG_ALLOCATION_AUTOMATIC = 3
+};
+
+enum BrigAluModifierMask {
+    BRIG_ALU_FTZ = 1
+};
+
+enum BrigAtomicOperation {
+
+    BRIG_ATOMIC_ADD = 0,
+    BRIG_ATOMIC_AND = 1,
+    BRIG_ATOMIC_CAS = 2,
+    BRIG_ATOMIC_EXCH = 3,
+    BRIG_ATOMIC_LD = 4,
+    BRIG_ATOMIC_MAX = 5,
+    BRIG_ATOMIC_MIN = 6,
+    BRIG_ATOMIC_OR = 7,
+    BRIG_ATOMIC_ST = 8,
+    BRIG_ATOMIC_SUB = 9,
+    BRIG_ATOMIC_WRAPDEC = 10,
+    BRIG_ATOMIC_WRAPINC = 11,
+    BRIG_ATOMIC_XOR = 12,
+    BRIG_ATOMIC_WAIT_EQ = 13,
+    BRIG_ATOMIC_WAIT_NE = 14,
+    BRIG_ATOMIC_WAIT_LT = 15,
+    BRIG_ATOMIC_WAIT_GTE = 16,
+    BRIG_ATOMIC_WAITTIMEOUT_EQ = 17,
+    BRIG_ATOMIC_WAITTIMEOUT_NE = 18,
+    BRIG_ATOMIC_WAITTIMEOUT_LT = 19,
+    BRIG_ATOMIC_WAITTIMEOUT_GTE = 20
+};
+
+enum BrigCompareOperation {
+
+    BRIG_COMPARE_EQ = 0,
+    BRIG_COMPARE_NE = 1,
+    BRIG_COMPARE_LT = 2,
+    BRIG_COMPARE_LE = 3,
+    BRIG_COMPARE_GT = 4,
+    BRIG_COMPARE_GE = 5,
+    BRIG_COMPARE_EQU = 6,
+    BRIG_COMPARE_NEU = 7,
+    BRIG_COMPARE_LTU = 8,
+    BRIG_COMPARE_LEU = 9,
+    BRIG_COMPARE_GTU = 10,
+    BRIG_COMPARE_GEU = 11,
+    BRIG_COMPARE_NUM = 12,
+    BRIG_COMPARE_NAN = 13,
+    BRIG_COMPARE_SEQ = 14,
+    BRIG_COMPARE_SNE = 15,
+    BRIG_COMPARE_SLT = 16,
+    BRIG_COMPARE_SLE = 17,
+    BRIG_COMPARE_SGT = 18,
+    BRIG_COMPARE_SGE = 19,
+    BRIG_COMPARE_SGEU = 20,
+    BRIG_COMPARE_SEQU = 21,
+    BRIG_COMPARE_SNEU = 22,
+    BRIG_COMPARE_SLTU = 23,
+    BRIG_COMPARE_SLEU = 24,
+    BRIG_COMPARE_SNUM = 25,
+    BRIG_COMPARE_SNAN = 26,
+    BRIG_COMPARE_SGTU = 27
+};
+
+enum BrigControlDirective {
+
+    BRIG_CONTROL_NONE = 0,
+    BRIG_CONTROL_ENABLEBREAKEXCEPTIONS = 1,
+    BRIG_CONTROL_ENABLEDETECTEXCEPTIONS = 2,
+    BRIG_CONTROL_MAXDYNAMICGROUPSIZE = 3,
+    BRIG_CONTROL_MAXFLATGRIDSIZE = 4,
+    BRIG_CONTROL_MAXFLATWORKGROUPSIZE = 5,
+    BRIG_CONTROL_REQUIREDDIM = 6,
+    BRIG_CONTROL_REQUIREDGRIDSIZE = 7,
+    BRIG_CONTROL_REQUIREDWORKGROUPSIZE = 8,
+    BRIG_CONTROL_REQUIRENOPARTIALWORKGROUPS = 9
+};
+
+enum BrigExecutableModifierMask {
+
+    BRIG_EXECUTABLE_DEFINITION = 1
+};
+
+enum BrigImageChannelOrder {
+
+    BRIG_CHANNEL_ORDER_A = 0,
+    BRIG_CHANNEL_ORDER_R = 1,
+    BRIG_CHANNEL_ORDER_RX = 2,
+    BRIG_CHANNEL_ORDER_RG = 3,
+    BRIG_CHANNEL_ORDER_RGX = 4,
+    BRIG_CHANNEL_ORDER_RA = 5,
+    BRIG_CHANNEL_ORDER_RGB = 6,
+    BRIG_CHANNEL_ORDER_RGBX = 7,
+    BRIG_CHANNEL_ORDER_RGBA = 8,
+    BRIG_CHANNEL_ORDER_BGRA = 9,
+    BRIG_CHANNEL_ORDER_ARGB = 10,
+    BRIG_CHANNEL_ORDER_ABGR = 11,
+    BRIG_CHANNEL_ORDER_SRGB = 12,
+    BRIG_CHANNEL_ORDER_SRGBX = 13,
+    BRIG_CHANNEL_ORDER_SRGBA = 14,
+    BRIG_CHANNEL_ORDER_SBGRA = 15,
+    BRIG_CHANNEL_ORDER_INTENSITY = 16,
+    BRIG_CHANNEL_ORDER_LUMINANCE = 17,
+    BRIG_CHANNEL_ORDER_DEPTH = 18,
+    BRIG_CHANNEL_ORDER_DEPTH_STENCIL = 19,
+
+    BRIG_CHANNEL_ORDER_UNKNOWN,
+
+    BRIG_CHANNEL_ORDER_FIRST_USER_DEFINED = 128
+
+};
+
+enum BrigImageChannelType {
+
+    BRIG_CHANNEL_TYPE_SNORM_INT8 = 0,
+    BRIG_CHANNEL_TYPE_SNORM_INT16 = 1,
+    BRIG_CHANNEL_TYPE_UNORM_INT8 = 2,
+    BRIG_CHANNEL_TYPE_UNORM_INT16 = 3,
+    BRIG_CHANNEL_TYPE_UNORM_INT24 = 4,
+    BRIG_CHANNEL_TYPE_UNORM_SHORT_555 = 5,
+    BRIG_CHANNEL_TYPE_UNORM_SHORT_565 = 6,
+    BRIG_CHANNEL_TYPE_UNORM_INT_101010 = 7,
+    BRIG_CHANNEL_TYPE_SIGNED_INT8 = 8,
+    BRIG_CHANNEL_TYPE_SIGNED_INT16 = 9,
+    BRIG_CHANNEL_TYPE_SIGNED_INT32 = 10,
+    BRIG_CHANNEL_TYPE_UNSIGNED_INT8 = 11,
+    BRIG_CHANNEL_TYPE_UNSIGNED_INT16 = 12,
+    BRIG_CHANNEL_TYPE_UNSIGNED_INT32 = 13,
+    BRIG_CHANNEL_TYPE_HALF_FLOAT = 14,
+    BRIG_CHANNEL_TYPE_FLOAT = 15,
+
+    BRIG_CHANNEL_TYPE_UNKNOWN,
+
+    BRIG_CHANNEL_TYPE_FIRST_USER_DEFINED = 128
+};
+
+enum BrigImageGeometry {
+
+    BRIG_GEOMETRY_1D = 0,
+    BRIG_GEOMETRY_2D = 1,
+    BRIG_GEOMETRY_3D = 2,
+    BRIG_GEOMETRY_1DA = 3,
+    BRIG_GEOMETRY_2DA = 4,
+    BRIG_GEOMETRY_1DB = 5,
+    BRIG_GEOMETRY_2DDEPTH = 6,
+    BRIG_GEOMETRY_2DADEPTH = 7,
+
+    BRIG_GEOMETRY_UNKNOWN,
+
+    BRIG_GEOMETRY_FIRST_USER_DEFINED = 128
+};
+
+enum BrigImageQuery {
+
+    BRIG_IMAGE_QUERY_WIDTH = 0,
+    BRIG_IMAGE_QUERY_HEIGHT = 1,
+    BRIG_IMAGE_QUERY_DEPTH = 2,
+    BRIG_IMAGE_QUERY_ARRAY = 3,
+    BRIG_IMAGE_QUERY_CHANNELORDER = 4,
+    BRIG_IMAGE_QUERY_CHANNELTYPE = 5
+};
+
+enum BrigLinkage {
+
+    BRIG_LINKAGE_NONE = 0,
+    BRIG_LINKAGE_PROGRAM = 1,
+    BRIG_LINKAGE_MODULE = 2,
+    BRIG_LINKAGE_FUNCTION = 3,
+    BRIG_LINKAGE_ARG = 4
+};
+
+enum BrigMachineModel {
+
+    BRIG_MACHINE_SMALL = 0,
+    BRIG_MACHINE_LARGE = 1,
+
+    BRIG_MACHINE_UNDEF = 2
+};
+
+enum BrigMemoryModifierMask {
+    BRIG_MEMORY_CONST = 1
+};
+
+enum BrigMemoryOrder {
+
+    BRIG_MEMORY_ORDER_NONE = 0,
+    BRIG_MEMORY_ORDER_RELAXED = 1,
+    BRIG_MEMORY_ORDER_SC_ACQUIRE = 2,
+    BRIG_MEMORY_ORDER_SC_RELEASE = 3,
+    BRIG_MEMORY_ORDER_SC_ACQUIRE_RELEASE = 4,
+
+    BRIG_MEMORY_ORDER_LAST = 5
+};
+
+enum BrigMemoryScope {
+
+    BRIG_MEMORY_SCOPE_NONE = 0,
+    BRIG_MEMORY_SCOPE_WORKITEM = 1,
+    BRIG_MEMORY_SCOPE_WAVEFRONT = 2,
+    BRIG_MEMORY_SCOPE_WORKGROUP = 3,
+    BRIG_MEMORY_SCOPE_AGENT = 4,
+    BRIG_MEMORY_SCOPE_SYSTEM = 5,
+
+    BRIG_MEMORY_SCOPE_LAST = 6
+};
+
+enum BrigOpcode {
+
+    BRIG_OPCODE_NOP = 0,
+    BRIG_OPCODE_ABS = 1,
+    BRIG_OPCODE_ADD = 2,
+    BRIG_OPCODE_BORROW = 3,
+    BRIG_OPCODE_CARRY = 4,
+    BRIG_OPCODE_CEIL = 5,
+    BRIG_OPCODE_COPYSIGN = 6,
+    BRIG_OPCODE_DIV = 7,
+    BRIG_OPCODE_FLOOR = 8,
+    BRIG_OPCODE_FMA = 9,
+    BRIG_OPCODE_FRACT = 10,
+    BRIG_OPCODE_MAD = 11,
+    BRIG_OPCODE_MAX = 12,
+    BRIG_OPCODE_MIN = 13,
+    BRIG_OPCODE_MUL = 14,
+    BRIG_OPCODE_MULHI = 15,
+    BRIG_OPCODE_NEG = 16,
+    BRIG_OPCODE_REM = 17,
+    BRIG_OPCODE_RINT = 18,
+    BRIG_OPCODE_SQRT = 19,
+    BRIG_OPCODE_SUB = 20,
+    BRIG_OPCODE_TRUNC = 21,
+    BRIG_OPCODE_MAD24 = 22,
+    BRIG_OPCODE_MAD24HI = 23,
+    BRIG_OPCODE_MUL24 = 24,
+    BRIG_OPCODE_MUL24HI = 25,
+    BRIG_OPCODE_SHL = 26,
+    BRIG_OPCODE_SHR = 27,
+    BRIG_OPCODE_AND = 28,
+    BRIG_OPCODE_NOT = 29,
+    BRIG_OPCODE_OR = 30,
+    BRIG_OPCODE_POPCOUNT = 31,
+    BRIG_OPCODE_XOR = 32,
+    BRIG_OPCODE_BITEXTRACT = 33,
+    BRIG_OPCODE_BITINSERT = 34,
+    BRIG_OPCODE_BITMASK = 35,
+    BRIG_OPCODE_BITREV = 36,
+    BRIG_OPCODE_BITSELECT = 37,
+    BRIG_OPCODE_FIRSTBIT = 38,
+    BRIG_OPCODE_LASTBIT = 39,
+    BRIG_OPCODE_COMBINE = 40,
+    BRIG_OPCODE_EXPAND = 41,
+    BRIG_OPCODE_LDA = 42,
+    BRIG_OPCODE_MOV = 43,
+    BRIG_OPCODE_SHUFFLE = 44,
+    BRIG_OPCODE_UNPACKHI = 45,
+    BRIG_OPCODE_UNPACKLO = 46,
+    BRIG_OPCODE_PACK = 47,
+    BRIG_OPCODE_UNPACK = 48,
+    BRIG_OPCODE_CMOV = 49,
+    BRIG_OPCODE_CLASS = 50,
+    BRIG_OPCODE_NCOS = 51,
+    BRIG_OPCODE_NEXP2 = 52,
+    BRIG_OPCODE_NFMA = 53,
+    BRIG_OPCODE_NLOG2 = 54,
+    BRIG_OPCODE_NRCP = 55,
+    BRIG_OPCODE_NRSQRT = 56,
+    BRIG_OPCODE_NSIN = 57,
+    BRIG_OPCODE_NSQRT = 58,
+    BRIG_OPCODE_BITALIGN = 59,
+    BRIG_OPCODE_BYTEALIGN = 60,
+    BRIG_OPCODE_PACKCVT = 61,
+    BRIG_OPCODE_UNPACKCVT = 62,
+    BRIG_OPCODE_LERP = 63,
+    BRIG_OPCODE_SAD = 64,
+    BRIG_OPCODE_SADHI = 65,
+    BRIG_OPCODE_SEGMENTP = 66,
+    BRIG_OPCODE_FTOS = 67,
+    BRIG_OPCODE_STOF = 68,
+    BRIG_OPCODE_CMP = 69,
+    BRIG_OPCODE_CVT = 70,
+    BRIG_OPCODE_LD = 71,
+    BRIG_OPCODE_ST = 72,
+    BRIG_OPCODE_ATOMIC = 73,
+    BRIG_OPCODE_ATOMICNORET = 74,
+    BRIG_OPCODE_SIGNAL = 75,
+    BRIG_OPCODE_SIGNALNORET = 76,
+    BRIG_OPCODE_MEMFENCE = 77,
+    BRIG_OPCODE_RDIMAGE = 78,
+    BRIG_OPCODE_LDIMAGE = 79,
+    BRIG_OPCODE_STIMAGE = 80,
+    BRIG_OPCODE_IMAGEFENCE = 81,
+    BRIG_OPCODE_QUERYIMAGE = 82,
+    BRIG_OPCODE_QUERYSAMPLER = 83,
+    BRIG_OPCODE_CBR = 84,
+    BRIG_OPCODE_BR = 85,
+    BRIG_OPCODE_SBR = 86,
+    BRIG_OPCODE_BARRIER = 87,
+    BRIG_OPCODE_WAVEBARRIER = 88,
+    BRIG_OPCODE_ARRIVEFBAR = 89,
+    BRIG_OPCODE_INITFBAR = 90,
+    BRIG_OPCODE_JOINFBAR = 91,
+    BRIG_OPCODE_LEAVEFBAR = 92,
+    BRIG_OPCODE_RELEASEFBAR = 93,
+    BRIG_OPCODE_WAITFBAR = 94,
+    BRIG_OPCODE_LDF = 95,
+    BRIG_OPCODE_ACTIVELANECOUNT = 96,
+    BRIG_OPCODE_ACTIVELANEID = 97,
+    BRIG_OPCODE_ACTIVELANEMASK = 98,
+    BRIG_OPCODE_ACTIVELANEPERMUTE = 99,
+    BRIG_OPCODE_CALL = 100,
+    BRIG_OPCODE_SCALL = 101,
+    BRIG_OPCODE_ICALL = 102,
+    BRIG_OPCODE_RET = 103,
+    BRIG_OPCODE_ALLOCA = 104,
+    BRIG_OPCODE_CURRENTWORKGROUPSIZE = 105,
+    BRIG_OPCODE_CURRENTWORKITEMFLATID = 106,
+    BRIG_OPCODE_DIM = 107,
+    BRIG_OPCODE_GRIDGROUPS = 108,
+    BRIG_OPCODE_GRIDSIZE = 109,
+    BRIG_OPCODE_PACKETCOMPLETIONSIG = 110,
+    BRIG_OPCODE_PACKETID = 111,
+    BRIG_OPCODE_WORKGROUPID = 112,
+    BRIG_OPCODE_WORKGROUPSIZE = 113,
+    BRIG_OPCODE_WORKITEMABSID = 114,
+    BRIG_OPCODE_WORKITEMFLATABSID = 115,
+    BRIG_OPCODE_WORKITEMFLATID = 116,
+    BRIG_OPCODE_WORKITEMID = 117,
+    BRIG_OPCODE_CLEARDETECTEXCEPT = 118,
+    BRIG_OPCODE_GETDETECTEXCEPT = 119,
+    BRIG_OPCODE_SETDETECTEXCEPT = 120,
+    BRIG_OPCODE_ADDQUEUEWRITEINDEX = 121,
+    BRIG_OPCODE_CASQUEUEWRITEINDEX = 122,
+    BRIG_OPCODE_LDQUEUEREADINDEX = 123,
+    BRIG_OPCODE_LDQUEUEWRITEINDEX = 124,
+    BRIG_OPCODE_STQUEUEREADINDEX = 125,
+    BRIG_OPCODE_STQUEUEWRITEINDEX = 126,
+    BRIG_OPCODE_CLOCK = 127,
+    BRIG_OPCODE_CUID = 128,
+    BRIG_OPCODE_DEBUGTRAP = 129,
+    BRIG_OPCODE_GROUPBASEPTR = 130,
+    BRIG_OPCODE_KERNARGBASEPTR = 131,
+    BRIG_OPCODE_LANEID = 132,
+    BRIG_OPCODE_MAXCUID = 133,
+    BRIG_OPCODE_MAXWAVEID = 134,
+    BRIG_OPCODE_NULLPTR = 135,
+    BRIG_OPCODE_WAVEID = 136,
+    BRIG_OPCODE_FIRST_USER_DEFINED = 32768,
+
+    BRIG_OPCODE_GCNMADU = (1u << 15) | 0,
+    BRIG_OPCODE_GCNMADS = (1u << 15) | 1,
+    BRIG_OPCODE_GCNMAX3 = (1u << 15) | 2,
+    BRIG_OPCODE_GCNMIN3 = (1u << 15) | 3,
+    BRIG_OPCODE_GCNMED3 = (1u << 15) | 4,
+    BRIG_OPCODE_GCNFLDEXP = (1u << 15) | 5,
+    BRIG_OPCODE_GCNFREXP_EXP = (1u << 15) | 6,
+    BRIG_OPCODE_GCNFREXP_MANT = (1u << 15) | 7,
+    BRIG_OPCODE_GCNTRIG_PREOP = (1u << 15) | 8,
+    BRIG_OPCODE_GCNBFM = (1u << 15) | 9,
+    BRIG_OPCODE_GCNLD = (1u << 15) | 10,
+    BRIG_OPCODE_GCNST = (1u << 15) | 11,
+    BRIG_OPCODE_GCNATOMIC = (1u << 15) | 12,
+    BRIG_OPCODE_GCNATOMICNORET = (1u << 15) | 13,
+    BRIG_OPCODE_GCNSLEEP = (1u << 15) | 14,
+    BRIG_OPCODE_GCNPRIORITY = (1u << 15) | 15,
+    BRIG_OPCODE_GCNREGIONALLOC = (1u << 15) | 16,
+    BRIG_OPCODE_GCNMSAD = (1u << 15) | 17,
+    BRIG_OPCODE_GCNQSAD = (1u << 15) | 18,
+    BRIG_OPCODE_GCNMQSAD = (1u << 15) | 19,
+    BRIG_OPCODE_GCNMQSAD4 = (1u << 15) | 20,
+    BRIG_OPCODE_GCNSADW = (1u << 15) | 21,
+    BRIG_OPCODE_GCNSADD = (1u << 15) | 22,
+    BRIG_OPCODE_GCNCONSUME = (1u << 15) | 23,
+    BRIG_OPCODE_GCNAPPEND = (1u << 15) | 24,
+    BRIG_OPCODE_GCNB4XCHG = (1u << 15) | 25,
+    BRIG_OPCODE_GCNB32XCHG = (1u << 15) | 26,
+    BRIG_OPCODE_GCNMAX = (1u << 15) | 27,
+    BRIG_OPCODE_GCNMIN = (1u << 15) | 28,
+    BRIG_OPCODE_GCNDIVRELAXED = (1u << 15) | 29,
+    BRIG_OPCODE_GCNDIVRELAXEDNARROW = (1u << 15) | 30
+};
+
+enum BrigPack {
+
+    BRIG_PACK_NONE = 0,
+    BRIG_PACK_PP = 1,
+    BRIG_PACK_PS = 2,
+    BRIG_PACK_SP = 3,
+    BRIG_PACK_SS = 4,
+    BRIG_PACK_S = 5,
+    BRIG_PACK_P = 6,
+    BRIG_PACK_PPSAT = 7,
+    BRIG_PACK_PSSAT = 8,
+    BRIG_PACK_SPSAT = 9,
+    BRIG_PACK_SSSAT = 10,
+    BRIG_PACK_SSAT = 11,
+    BRIG_PACK_PSAT = 12
+};
+
+enum BrigProfile {
+
+    BRIG_PROFILE_BASE = 0,
+    BRIG_PROFILE_FULL = 1,
+
+    BRIG_PROFILE_UNDEF = 2
+};
+
+enum BrigRegisterKind {
+
+    BRIG_REGISTER_KIND_CONTROL = 0,
+    BRIG_REGISTER_KIND_SINGLE = 1,
+    BRIG_REGISTER_KIND_DOUBLE = 2,
+    BRIG_REGISTER_KIND_QUAD = 3
+};
+
+enum BrigRound {
+
+    BRIG_ROUND_NONE = 0,
+    BRIG_ROUND_FLOAT_DEFAULT = 1,
+    BRIG_ROUND_FLOAT_NEAR_EVEN = 2,
+    BRIG_ROUND_FLOAT_ZERO = 3,
+    BRIG_ROUND_FLOAT_PLUS_INFINITY = 4,
+    BRIG_ROUND_FLOAT_MINUS_INFINITY = 5,
+    BRIG_ROUND_INTEGER_NEAR_EVEN = 6,
+    BRIG_ROUND_INTEGER_ZERO = 7,
+    BRIG_ROUND_INTEGER_PLUS_INFINITY = 8,
+    BRIG_ROUND_INTEGER_MINUS_INFINITY = 9,
+    BRIG_ROUND_INTEGER_NEAR_EVEN_SAT = 10,
+    BRIG_ROUND_INTEGER_ZERO_SAT = 11,
+    BRIG_ROUND_INTEGER_PLUS_INFINITY_SAT = 12,
+    BRIG_ROUND_INTEGER_MINUS_INFINITY_SAT = 13,
+    BRIG_ROUND_INTEGER_SIGNALING_NEAR_EVEN = 14,
+    BRIG_ROUND_INTEGER_SIGNALING_ZERO = 15,
+    BRIG_ROUND_INTEGER_SIGNALING_PLUS_INFINITY = 16,
+    BRIG_ROUND_INTEGER_SIGNALING_MINUS_INFINITY = 17,
+    BRIG_ROUND_INTEGER_SIGNALING_NEAR_EVEN_SAT = 18,
+    BRIG_ROUND_INTEGER_SIGNALING_ZERO_SAT = 19,
+    BRIG_ROUND_INTEGER_SIGNALING_PLUS_INFINITY_SAT = 20,
+    BRIG_ROUND_INTEGER_SIGNALING_MINUS_INFINITY_SAT = 21
+};
+
+enum BrigSamplerAddressing {
+
+    BRIG_ADDRESSING_UNDEFINED = 0,
+    BRIG_ADDRESSING_CLAMP_TO_EDGE = 1,
+    BRIG_ADDRESSING_CLAMP_TO_BORDER = 2,
+    BRIG_ADDRESSING_REPEAT = 3,
+    BRIG_ADDRESSING_MIRRORED_REPEAT = 4,
+
+    BRIG_ADDRESSING_FIRST_USER_DEFINED = 128
+};
+
+enum BrigSamplerCoordNormalization {
+
+    BRIG_COORD_UNNORMALIZED = 0,
+    BRIG_COORD_NORMALIZED = 1
+};
+
+enum BrigSamplerFilter {
+
+    BRIG_FILTER_NEAREST = 0,
+    BRIG_FILTER_LINEAR = 1,
+
+    BRIG_FILTER_FIRST_USER_DEFINED = 128
+};
+
+enum BrigSamplerQuery {
+
+    BRIG_SAMPLER_QUERY_ADDRESSING = 0,
+    BRIG_SAMPLER_QUERY_COORD = 1,
+    BRIG_SAMPLER_QUERY_FILTER = 2
+};
+
+enum BrigSectionIndex {
+
+    BRIG_SECTION_INDEX_DATA = 0,
+    BRIG_SECTION_INDEX_CODE = 1,
+    BRIG_SECTION_INDEX_OPERAND = 2,
+    BRIG_SECTION_INDEX_BEGIN_IMPLEMENTATION_DEFINED = 3,
+
+    BRIG_SECTION_INDEX_IMPLEMENTATION_DEFINED = BRIG_SECTION_INDEX_BEGIN_IMPLEMENTATION_DEFINED
+};
+
+enum BrigSegCvtModifierMask {
+    BRIG_SEG_CVT_NONULL = 1
+};
+
+enum BrigSegment {
+
+    BRIG_SEGMENT_NONE = 0,
+    BRIG_SEGMENT_FLAT = 1,
+    BRIG_SEGMENT_GLOBAL = 2,
+    BRIG_SEGMENT_READONLY = 3,
+    BRIG_SEGMENT_KERNARG = 4,
+    BRIG_SEGMENT_GROUP = 5,
+    BRIG_SEGMENT_PRIVATE = 6,
+    BRIG_SEGMENT_SPILL = 7,
+    BRIG_SEGMENT_ARG = 8,
+
+    BRIG_SEGMENT_FIRST_USER_DEFINED = 128,
+
+    BRIG_SEGMENT_AMD_GCN = 9
+};
+
+enum BrigPackedTypeBits {
+
+    BRIG_TYPE_BASE_SIZE  = 5,
+    BRIG_TYPE_PACK_SIZE  = 2,
+    BRIG_TYPE_ARRAY_SIZE = 1,
+
+    BRIG_TYPE_BASE_SHIFT  = 0,
+    BRIG_TYPE_PACK_SHIFT  = BRIG_TYPE_BASE_SHIFT + BRIG_TYPE_BASE_SIZE,
+    BRIG_TYPE_ARRAY_SHIFT = BRIG_TYPE_PACK_SHIFT + BRIG_TYPE_PACK_SIZE,
+
+    BRIG_TYPE_BASE_MASK  = ((1 << BRIG_TYPE_BASE_SIZE)  - 1) << BRIG_TYPE_BASE_SHIFT,
+    BRIG_TYPE_PACK_MASK  = ((1 << BRIG_TYPE_PACK_SIZE)  - 1) << BRIG_TYPE_PACK_SHIFT,
+    BRIG_TYPE_ARRAY_MASK = ((1 << BRIG_TYPE_ARRAY_SIZE) - 1) << BRIG_TYPE_ARRAY_SHIFT,
+
+    BRIG_TYPE_PACK_NONE = 0 << BRIG_TYPE_PACK_SHIFT,
+    BRIG_TYPE_PACK_32   = 1 << BRIG_TYPE_PACK_SHIFT,
+    BRIG_TYPE_PACK_64   = 2 << BRIG_TYPE_PACK_SHIFT,
+    BRIG_TYPE_PACK_128  = 3 << BRIG_TYPE_PACK_SHIFT,
+
+    BRIG_TYPE_ARRAY     = 1 << BRIG_TYPE_ARRAY_SHIFT
+};
+
+enum BrigType {
+
+    BRIG_TYPE_NONE  = 0,
+    BRIG_TYPE_U8    = 1,
+    BRIG_TYPE_U16   = 2,
+    BRIG_TYPE_U32   = 3,
+    BRIG_TYPE_U64   = 4,
+    BRIG_TYPE_S8    = 5,
+    BRIG_TYPE_S16   = 6,
+    BRIG_TYPE_S32   = 7,
+    BRIG_TYPE_S64   = 8,
+    BRIG_TYPE_F16   = 9,
+    BRIG_TYPE_F32   = 10,
+    BRIG_TYPE_F64   = 11,
+    BRIG_TYPE_B1    = 12,
+    BRIG_TYPE_B8    = 13,
+    BRIG_TYPE_B16   = 14,
+    BRIG_TYPE_B32   = 15,
+    BRIG_TYPE_B64   = 16,
+    BRIG_TYPE_B128  = 17,
+    BRIG_TYPE_SAMP  = 18,
+    BRIG_TYPE_ROIMG = 19,
+    BRIG_TYPE_WOIMG = 20,
+    BRIG_TYPE_RWIMG = 21,
+    BRIG_TYPE_SIG32 = 22,
+    BRIG_TYPE_SIG64 = 23,
+
+    BRIG_TYPE_U8X4  = BRIG_TYPE_U8  | BRIG_TYPE_PACK_32,
+    BRIG_TYPE_U8X8  = BRIG_TYPE_U8  | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_U8X16 = BRIG_TYPE_U8  | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_U16X2 = BRIG_TYPE_U16 | BRIG_TYPE_PACK_32,
+    BRIG_TYPE_U16X4 = BRIG_TYPE_U16 | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_U16X8 = BRIG_TYPE_U16 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_U32X2 = BRIG_TYPE_U32 | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_U32X4 = BRIG_TYPE_U32 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_U64X2 = BRIG_TYPE_U64 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_S8X4  = BRIG_TYPE_S8  | BRIG_TYPE_PACK_32,
+    BRIG_TYPE_S8X8  = BRIG_TYPE_S8  | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_S8X16 = BRIG_TYPE_S8  | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_S16X2 = BRIG_TYPE_S16 | BRIG_TYPE_PACK_32,
+    BRIG_TYPE_S16X4 = BRIG_TYPE_S16 | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_S16X8 = BRIG_TYPE_S16 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_S32X2 = BRIG_TYPE_S32 | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_S32X4 = BRIG_TYPE_S32 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_S64X2 = BRIG_TYPE_S64 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_F16X2 = BRIG_TYPE_F16 | BRIG_TYPE_PACK_32,
+    BRIG_TYPE_F16X4 = BRIG_TYPE_F16 | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_F16X8 = BRIG_TYPE_F16 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_F32X2 = BRIG_TYPE_F32 | BRIG_TYPE_PACK_64,
+    BRIG_TYPE_F32X4 = BRIG_TYPE_F32 | BRIG_TYPE_PACK_128,
+    BRIG_TYPE_F64X2 = BRIG_TYPE_F64 | BRIG_TYPE_PACK_128,
+
+    BRIG_TYPE_U8_ARRAY    = BRIG_TYPE_U8    | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U16_ARRAY   = BRIG_TYPE_U16   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U32_ARRAY   = BRIG_TYPE_U32   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U64_ARRAY   = BRIG_TYPE_U64   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S8_ARRAY    = BRIG_TYPE_S8    | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S16_ARRAY   = BRIG_TYPE_S16   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S32_ARRAY   = BRIG_TYPE_S32   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S64_ARRAY   = BRIG_TYPE_S64   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F16_ARRAY   = BRIG_TYPE_F16   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F32_ARRAY   = BRIG_TYPE_F32   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F64_ARRAY   = BRIG_TYPE_F64   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_B8_ARRAY    = BRIG_TYPE_B8    | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_B16_ARRAY   = BRIG_TYPE_B16   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_B32_ARRAY   = BRIG_TYPE_B32   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_B64_ARRAY   = BRIG_TYPE_B64   | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_B128_ARRAY  = BRIG_TYPE_B128  | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_SAMP_ARRAY  = BRIG_TYPE_SAMP  | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_ROIMG_ARRAY = BRIG_TYPE_ROIMG | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_WOIMG_ARRAY = BRIG_TYPE_WOIMG | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_RWIMG_ARRAY = BRIG_TYPE_RWIMG | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_SIG32_ARRAY = BRIG_TYPE_SIG32 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_SIG64_ARRAY = BRIG_TYPE_SIG64 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U8X4_ARRAY  = BRIG_TYPE_U8X4  | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U8X8_ARRAY  = BRIG_TYPE_U8X8  | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U8X16_ARRAY = BRIG_TYPE_U8X16 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U16X2_ARRAY = BRIG_TYPE_U16X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U16X4_ARRAY = BRIG_TYPE_U16X4 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U16X8_ARRAY = BRIG_TYPE_U16X8 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U32X2_ARRAY = BRIG_TYPE_U32X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U32X4_ARRAY = BRIG_TYPE_U32X4 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_U64X2_ARRAY = BRIG_TYPE_U64X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S8X4_ARRAY  = BRIG_TYPE_S8X4  | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S8X8_ARRAY  = BRIG_TYPE_S8X8  | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S8X16_ARRAY = BRIG_TYPE_S8X16 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S16X2_ARRAY = BRIG_TYPE_S16X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S16X4_ARRAY = BRIG_TYPE_S16X4 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S16X8_ARRAY = BRIG_TYPE_S16X8 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S32X2_ARRAY = BRIG_TYPE_S32X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S32X4_ARRAY = BRIG_TYPE_S32X4 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_S64X2_ARRAY = BRIG_TYPE_S64X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F16X2_ARRAY = BRIG_TYPE_F16X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F16X4_ARRAY = BRIG_TYPE_F16X4 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F16X8_ARRAY = BRIG_TYPE_F16X8 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F32X2_ARRAY = BRIG_TYPE_F32X2 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F32X4_ARRAY = BRIG_TYPE_F32X4 | BRIG_TYPE_ARRAY,
+    BRIG_TYPE_F64X2_ARRAY = BRIG_TYPE_F64X2 | BRIG_TYPE_ARRAY,
+
+    BRIG_TYPE_INVALID = (unsigned) -1
+};
+
+enum BrigVariableModifierMask {
+
+    BRIG_VARIABLE_DEFINITION = 1,
+    BRIG_VARIABLE_CONST = 2
+};
+
+enum BrigWidth {
+
+    BRIG_WIDTH_NONE = 0,
+    BRIG_WIDTH_1 = 1,
+    BRIG_WIDTH_2 = 2,
+    BRIG_WIDTH_4 = 3,
+    BRIG_WIDTH_8 = 4,
+    BRIG_WIDTH_16 = 5,
+    BRIG_WIDTH_32 = 6,
+    BRIG_WIDTH_64 = 7,
+    BRIG_WIDTH_128 = 8,
+    BRIG_WIDTH_256 = 9,
+    BRIG_WIDTH_512 = 10,
+    BRIG_WIDTH_1024 = 11,
+    BRIG_WIDTH_2048 = 12,
+    BRIG_WIDTH_4096 = 13,
+    BRIG_WIDTH_8192 = 14,
+    BRIG_WIDTH_16384 = 15,
+    BRIG_WIDTH_32768 = 16,
+    BRIG_WIDTH_65536 = 17,
+    BRIG_WIDTH_131072 = 18,
+    BRIG_WIDTH_262144 = 19,
+    BRIG_WIDTH_524288 = 20,
+    BRIG_WIDTH_1048576 = 21,
+    BRIG_WIDTH_2097152 = 22,
+    BRIG_WIDTH_4194304 = 23,
+    BRIG_WIDTH_8388608 = 24,
+    BRIG_WIDTH_16777216 = 25,
+    BRIG_WIDTH_33554432 = 26,
+    BRIG_WIDTH_67108864 = 27,
+    BRIG_WIDTH_134217728 = 28,
+    BRIG_WIDTH_268435456 = 29,
+    BRIG_WIDTH_536870912 = 30,
+    BRIG_WIDTH_1073741824 = 31,
+    BRIG_WIDTH_2147483648 = 32,
+    BRIG_WIDTH_WAVESIZE = 33,
+    BRIG_WIDTH_ALL = 34,
+
+    BRIG_WIDTH_LAST
+};
+
+struct BrigUInt64 {
+    uint32_t lo;
+    uint32_t hi;
+
+};
+
+struct BrigAluModifier {
+    BrigAluModifier8_t allBits;
+
+};
+
+struct BrigBase {
+    uint16_t byteCount;
+    BrigKind16_t kind;
+};
+
+struct BrigData {
+
+    uint32_t byteCount;
+    uint8_t bytes[1];
+};
+
+struct BrigExecutableModifier {
+    BrigExecutableModifier8_t allBits;
+
+};
+
+struct BrigMemoryModifier {
+    BrigMemoryModifier8_t allBits;
+
+};
+
+struct BrigSegCvtModifier {
+    BrigSegCvtModifier8_t allBits;
+
+};
+
+struct BrigVariableModifier {
+    BrigVariableModifier8_t allBits;
+
+};
+
+struct BrigDirectiveArgBlockEnd {
+    BrigBase base;
+};
+
+struct BrigDirectiveArgBlockStart {
+    BrigBase base;
+};
+
+struct BrigDirectiveComment {
+    BrigBase base;
+    BrigDataOffsetString32_t name;
+};
+
+struct BrigDirectiveControl {
+    BrigBase base;
+    BrigControlDirective16_t control;
+    uint16_t reserved;
+    BrigDataOffsetOperandList32_t operands;
+};
+
+struct BrigDirectiveExecutable {
+    BrigBase base;
+    BrigDataOffsetString32_t name;
+    uint16_t outArgCount;
+    uint16_t inArgCount;
+    BrigCodeOffset32_t firstInArg;
+    BrigCodeOffset32_t firstCodeBlockEntry;
+    BrigCodeOffset32_t nextModuleEntry;
+    BrigExecutableModifier modifier;
+    BrigLinkage8_t linkage;
+    uint16_t reserved;
+};
+
+struct BrigDirectiveExtension {
+    BrigBase base;
+    BrigDataOffsetString32_t name;
+};
+
+struct BrigDirectiveFbarrier {
+    BrigBase base;
+    BrigDataOffsetString32_t name;
+    BrigVariableModifier modifier;
+    BrigLinkage8_t linkage;
+    uint16_t reserved;
+};
+
+struct BrigDirectiveLabel {
+    BrigBase base;
+    BrigDataOffsetString32_t name;
+};
+
+struct BrigDirectiveLoc {
+    BrigBase base;
+    BrigDataOffsetString32_t filename;
+    uint32_t line;
+    uint32_t column;
+};
+
+struct BrigDirectiveNone {
+    BrigBase base;
+};
+
+struct BrigDirectivePragma {
+    BrigBase base;
+    BrigDataOffsetOperandList32_t operands;
+};
+
+struct BrigDirectiveVariable {
+    BrigBase base;
+    BrigDataOffsetString32_t name;
+    BrigOperandOffset32_t init;
+    BrigType16_t type;
+
+    BrigSegment8_t segment;
+    BrigAlignment8_t align;
+    BrigUInt64 dim;
+    BrigVariableModifier modifier;
+    BrigLinkage8_t linkage;
+    BrigAllocation8_t allocation;
+    uint8_t reserved;
+};
+
+struct BrigDirectiveModule {
+    BrigBase base;
+    BrigDataOffsetString32_t name;
+    BrigVersion32_t hsailMajor;
+    BrigVersion32_t hsailMinor;
+    BrigProfile8_t profile;
+    BrigMachineModel8_t machineModel;
+    BrigRound8_t defaultFloatRound;
+    uint8_t reserved;
+};
+
+struct BrigInstBase {
+    BrigBase base;
+    BrigOpcode16_t opcode;
+    BrigType16_t type;
+    BrigDataOffsetOperandList32_t operands;
+
+};
+
+struct BrigInstAddr {
+    BrigInstBase base;
+    BrigSegment8_t segment;
+    uint8_t reserved[3];
+};
+
+struct BrigInstAtomic {
+    BrigInstBase base;
+    BrigSegment8_t segment;
+    BrigMemoryOrder8_t memoryOrder;
+    BrigMemoryScope8_t memoryScope;
+    BrigAtomicOperation8_t atomicOperation;
+    uint8_t equivClass;
+    uint8_t reserved[3];
+};
+
+struct BrigInstBasic {
+    BrigInstBase base;
+};
+
+struct BrigInstBr {
+    BrigInstBase base;
+    BrigWidth8_t width;
+    uint8_t reserved[3];
+};
+
+struct BrigInstCmp {
+    BrigInstBase base;
+    BrigType16_t sourceType;
+    BrigAluModifier modifier;
+    BrigCompareOperation8_t compare;
+    BrigPack8_t pack;
+    uint8_t reserved[3];
+};
+
+struct BrigInstCvt {
+    BrigInstBase base;
+    BrigType16_t sourceType;
+    BrigAluModifier modifier;
+    BrigRound8_t round;
+};
+
+struct BrigInstImage {
+    BrigInstBase base;
+    BrigType16_t imageType;
+    BrigType16_t coordType;
+    BrigImageGeometry8_t geometry;
+    uint8_t equivClass;
+    uint16_t reserved;
+};
+
+struct BrigInstLane {
+    BrigInstBase base;
+    BrigType16_t sourceType;
+    BrigWidth8_t width;
+    uint8_t reserved;
+};
+
+struct BrigInstMem {
+    BrigInstBase base;
+    BrigSegment8_t segment;
+    BrigAlignment8_t align;
+    uint8_t equivClass;
+    BrigWidth8_t width;
+    BrigMemoryModifier modifier;
+    uint8_t reserved[3];
+};
+
+struct BrigInstMemFence {
+    BrigInstBase base;
+    BrigMemoryOrder8_t memoryOrder;
+    BrigMemoryScope8_t globalSegmentMemoryScope;
+    BrigMemoryScope8_t groupSegmentMemoryScope;
+    BrigMemoryScope8_t imageSegmentMemoryScope;
+};
+
+struct BrigInstMod {
+    BrigInstBase base;
+    BrigAluModifier modifier;
+    BrigRound8_t round;
+    BrigPack8_t pack;
+    uint8_t reserved;
+};
+
+struct BrigInstQueryImage {
+    BrigInstBase base;
+    BrigType16_t imageType;
+    BrigImageGeometry8_t geometry;
+    BrigImageQuery8_t imageQuery;
+};
+
+struct BrigInstQuerySampler {
+    BrigInstBase base;
+    BrigSamplerQuery8_t samplerQuery;
+    uint8_t reserved[3];
+};
+
+struct BrigInstQueue {
+    BrigInstBase base;
+    BrigSegment8_t segment;
+    BrigMemoryOrder8_t memoryOrder;
+    uint16_t reserved;
+};
+
+struct BrigInstSeg {
+    BrigInstBase base;
+    BrigSegment8_t segment;
+    uint8_t reserved[3];
+};
+
+struct BrigInstSegCvt {
+    BrigInstBase base;
+    BrigType16_t sourceType;
+    BrigSegment8_t segment;
+    BrigSegCvtModifier modifier;
+};
+
+struct BrigInstSignal {
+    BrigInstBase base;
+    BrigType16_t signalType;
+    BrigMemoryOrder8_t memoryOrder;
+    BrigAtomicOperation8_t signalOperation;
+};
+
+struct BrigInstSourceType {
+    BrigInstBase base;
+    BrigType16_t sourceType;
+    uint16_t reserved;
+};
+
+struct BrigOperandAddress {
+    BrigBase base;
+    BrigCodeOffset32_t symbol;
+    BrigOperandOffset32_t reg;
+    BrigUInt64 offset;
+};
+
+struct BrigOperandAlign {
+    BrigBase base;
+    BrigAlignment8_t align;
+    uint8_t reserved[3];
+};
+
+struct BrigOperandCodeList {
+    BrigBase base;
+    BrigDataOffsetCodeList32_t elements;
+
+};
+
+struct BrigOperandCodeRef {
+    BrigBase base;
+    BrigCodeOffset32_t ref;
+};
+
+struct BrigOperandConstantBytes {
+    BrigBase base;
+    BrigType16_t type;
+    uint16_t reserved;
+    BrigDataOffsetString32_t bytes;
+};
+
+struct BrigOperandConstantOperandList {
+    BrigBase base;
+    BrigType16_t type;
+    uint16_t reserved;
+    BrigDataOffsetOperandList32_t elements;
+
+};
+
+struct BrigOperandConstantImage {
+    BrigBase base;
+    BrigType16_t type;
+    BrigImageGeometry8_t geometry;
+    BrigImageChannelOrder8_t channelOrder;
+    BrigImageChannelType8_t channelType;
+    uint8_t reserved[3];
+    BrigUInt64 width;
+    BrigUInt64 height;
+    BrigUInt64 depth;
+    BrigUInt64 array;
+};
+
+struct BrigOperandOperandList {
+    BrigBase base;
+    BrigDataOffsetOperandList32_t elements;
+
+};
+
+struct BrigOperandRegister {
+    BrigBase base;
+    BrigRegisterKind16_t regKind;
+    uint16_t regNum;
+};
+
+struct BrigOperandConstantSampler {
+    BrigBase base;
+    BrigType16_t type;
+    BrigSamplerCoordNormalization8_t coord;
+    BrigSamplerFilter8_t filter;
+    BrigSamplerAddressing8_t addressing;
+    uint8_t reserved[3];
+};
+
+struct BrigOperandString {
+    BrigBase base;
+    BrigDataOffsetString32_t string;
+};
+
+struct BrigOperandWavesize {
+    BrigBase base;
+};
+
+enum BrigExceptionsMask {
+    BRIG_EXCEPTIONS_INVALID_OPERATION = 1 << 0,
+    BRIG_EXCEPTIONS_DIVIDE_BY_ZERO = 1 << 1,
+    BRIG_EXCEPTIONS_OVERFLOW = 1 << 2,
+    BRIG_EXCEPTIONS_UNDERFLOW = 1 << 3,
+    BRIG_EXCEPTIONS_INEXACT = 1 << 4,
+
+    BRIG_EXCEPTIONS_FIRST_USER_DEFINED = 1 << 16
+};
+
+struct BrigSectionHeader {
+    uint64_t byteCount;
+    uint32_t headerByteCount;
+    uint32_t nameLength;
+    uint8_t name[1];
+};
+
+#define MODULE_IDENTIFICATION_LENGTH (8)
+
+struct BrigModuleHeader {
+    char identification[MODULE_IDENTIFICATION_LENGTH];
+    BrigVersion32_t brigMajor;
+    BrigVersion32_t brigMinor;
+    uint64_t byteCount;
+    uint8_t hash[64];
+    uint32_t reserved;
+    uint32_t sectionCount;
+    uint64_t sectionIndex;
+};
+
+typedef BrigModuleHeader* BrigModule_t;
+
+#endif /* HSA_BRIG_FORMAT_H */

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 11/12] Majority of the HSA back-end
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (9 preceding siblings ...)
  2015-11-05 22:05 ` [hsa 10/12] HSAIL BRIG description header file (hopefully not a licensing issue) Martin Jambor
@ 2015-11-05 22:06 ` Martin Jambor
  2015-11-05 22:07 ` [hsa 12/12] HSA register allocator Martin Jambor
  2015-11-06 10:13 ` Merge of HSA branch Bernd Schmidt
  12 siblings, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 22:06 UTC (permalink / raw)
  To: GCC Patches; +Cc: Martin Liska

Hi,

the following patch comprises the parts of the HSA back-end that have
been developed by myself and Martin Liska (with some help of our
friends at AMD) - as opposed to the register allocator in the next
patch.

Full description of the back-end would result in an email so long
nobody would read it.  So let me just briefly describe the individual
files.

- hsa.h is the header file for GCC-specific HSA data structures and
  functions shared among a number of compilation units.

- hsa.c contains common HSA-related functionality that was too big to
  be in a header file.

- hsa-gen.c contains the HSA generating pass class and functionality
  required to translate GIMPLE into our own internal representation of
  HSAIL.

- hsa-dump.c contains functions capable of dumping HSA stuff.

- hsa-brig.c is where creation of the BRIG format is implemented.

The hunk in toplev just calls a function in hsa-brig.c that emits out
the created BRIG module at the end of the compilation.

Thanks,

Martin


2015-11-05  Martin Jambor  <mjambor@suse.cz>
            Martin Liska  <mliska@suse.cz>

	* hsa-brig.c: New file.
	* hsa-dump.c: Likewise.
	* hsa-gen.c: Likewise.
	* hsa.c: Likewise.
	* hsa.h: Likewise.
	* toplev.c (compile_file): Call hsa_output_brig.

diff --git a/gcc/hsa-brig.c b/gcc/hsa-brig.c
new file mode 100644
index 0000000..c51f635
--- /dev/null
+++ b/gcc/hsa-brig.c
@@ -0,0 +1,2247 @@
+/* Producing binary form of HSA BRIG from our internal representation.
+   Copyright (C) 2013-2015 Free Software Foundation, Inc.
+   Contributed by Martin Jambor <mjambor@suse.cz> and
+   Martin Liska <mliska@suse.cz>.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "is-a.h"
+#include "vec.h"
+#include "hash-table.h"
+#include "hash-map.h"
+#include "tree.h"
+#include "tree-iterator.h"
+#include "stor-layout.h"
+#include "output.h"
+#include "cfg.h"
+#include "function.h"
+#include "fold-const.h"
+#include "stringpool.h"
+#include "gimple-pretty-print.h"
+#include "diagnostic-core.h"
+#include "cgraph.h"
+#include "dumpfile.h"
+#include "print-tree.h"
+#include "symbol-summary.h"
+#include "hsa.h"
+
+#define BRIG_ELF_SECTION_NAME ".brig"
+#define BRIG_LABEL_STRING "hsa_brig"
+#define BRIG_SECTION_DATA_NAME    "hsa_data"
+#define BRIG_SECTION_CODE_NAME    "hsa_code"
+#define BRIG_SECTION_OPERAND_NAME "hsa_operand"
+
+#define BRIG_CHUNK_MAX_SIZE (64 * 1024)
+
+/* Chunks of BRIG binary data.  */
+
+struct hsa_brig_data_chunk
+{
+  /* Size of the data already stored into a chunk.  */
+  unsigned size;
+
+  /* Pointer to the data.  */
+  char *data;
+};
+
+/* Structure representing a BRIG section, holding and writing its data.  */
+
+class hsa_brig_section
+{
+public:
+  /* Section name that will be output to the BRIG.  */
+  const char *section_name;
+  /* Size in bytes of all data stored in the section.  */
+  unsigned total_size;
+  /* The size of the header of the section including padding. */
+  unsigned header_byte_count;
+  /* The size of the header of the section without any padding.  */
+  unsigned header_byte_delta;
+
+  /* Buffers of binary data, each containing BRIG_CHUNK_MAX_SIZE bytes.  */
+  vec <struct hsa_brig_data_chunk> chunks;
+
+  /* More convenient access to the last chunk from the vector above. */
+  struct hsa_brig_data_chunk *cur_chunk;
+
+  void allocate_new_chunk ();
+  void init (const char *name);
+  void release ();
+  void output ();
+  unsigned add (const void *data, unsigned len);
+  void round_size_up (int factor);
+  void *get_ptr_by_offset (unsigned int offset);
+};
+
+static struct hsa_brig_section brig_data, brig_code, brig_operand;
+static uint32_t brig_insn_count;
+static bool brig_initialized = false;
+
+/* Mapping between emitted HSA functions and their offset in code segment.  */
+static hash_map<tree, BrigCodeOffset32_t> *function_offsets;
+
+/* Set of emitted function declarations.  */
+static hash_map <tree, BrigDirectiveExecutable *> *emitted_declarations;
+
+/* List of sbr instructions.  */
+static vec <hsa_insn_sbr *> *switch_instructions;
+
+struct function_linkage_pair
+{
+  function_linkage_pair (tree decl, unsigned int off):
+    function_decl (decl), offset (off) {}
+
+  /* Declaration of called function.  */
+  tree function_decl;
+
+  /* Offset in operand section.  */
+  unsigned int offset;
+};
+
+/* Vector of function calls where we need to resolve function offsets.  */
+static auto_vec <function_linkage_pair> function_call_linkage;
+
+/* Add a new chunk, allocate data for it and initialize it.  */
+
+void
+hsa_brig_section::allocate_new_chunk ()
+{
+  struct hsa_brig_data_chunk new_chunk;
+
+  new_chunk.data = XCNEWVEC (char, BRIG_CHUNK_MAX_SIZE);
+  new_chunk.size = 0;
+  cur_chunk = chunks.safe_push (new_chunk);
+}
+
+/* Initialize the brig section.  */
+
+void
+hsa_brig_section::init (const char *name)
+{
+  section_name = name;
+  /* While the following computation is basically wrong, because the intent
+     certainly wasn't to have the first character of name and padding, which
+     are a part of sizeof (BrigSectionHeader), included in the first addend,
+     this is what the disassembler expects.  */
+  total_size = sizeof (BrigSectionHeader) + strlen(section_name);
+  chunks.create (1);
+  allocate_new_chunk ();
+  header_byte_delta = total_size;
+  round_size_up (4);
+  header_byte_count = total_size;
+}
+
+/* Free all data in the section.  */
+
+void
+hsa_brig_section::release ()
+{
+  for (unsigned i = 0; i < chunks.length (); i++)
+    free (chunks[i].data);
+  chunks.release ();
+  cur_chunk = NULL;
+}
+
+/* Write the section to the output file to a section with the name given at
+   initialization.  Switches the output section and does not restore it.  */
+
+void
+hsa_brig_section::output ()
+{
+  struct BrigSectionHeader section_header;
+  char padding[8];
+
+  section_header.byteCount = htole64 (total_size);
+  section_header.headerByteCount = htole32 (header_byte_count);
+  section_header.nameLength = htole32 (strlen(section_name));
+  assemble_string ((const char*) &section_header, 16);
+  assemble_string (section_name, (section_header.nameLength));
+  memset (&padding, 0, sizeof (padding));
+  /* This is also a consequence of the wrong header size computation described
+     in a comment in hsa_brig_section::init.  */
+  assemble_string (padding, 8);
+  for (unsigned i = 0; i < chunks.length (); i++)
+    assemble_string (chunks[i].data, chunks[i].size);
+}
+
+/* Add to the stream LEN bytes of opaque binary DATA.  Return the offset at
+   which it was stored.  */
+
+unsigned
+hsa_brig_section::add (const void *data, unsigned len)
+{
+  unsigned offset = total_size;
+
+  gcc_assert (len <= BRIG_CHUNK_MAX_SIZE);
+  if (cur_chunk->size > (BRIG_CHUNK_MAX_SIZE - len))
+    allocate_new_chunk ();
+
+  memcpy (cur_chunk->data + cur_chunk->size, data, len);
+  cur_chunk->size += len;
+  total_size += len;
+
+  return offset;
+}
+
+/* Add padding to section so that its size is divisible by FACTOR.  */
+
+void
+hsa_brig_section::round_size_up (int factor)
+{
+  unsigned padding, res = total_size % factor;
+
+  if (res == 0)
+    return;
+
+  padding = factor - res;
+  total_size += padding;
+  if (cur_chunk->size > (BRIG_CHUNK_MAX_SIZE - padding))
+    {
+      padding -= BRIG_CHUNK_MAX_SIZE - cur_chunk->size;
+      cur_chunk->size = BRIG_CHUNK_MAX_SIZE;
+      allocate_new_chunk ();
+    }
+
+  cur_chunk->size += padding;
+}
+
+/* Return pointer to data by global OFFSET in the section.  */
+
+void*
+hsa_brig_section::get_ptr_by_offset (unsigned int offset)
+{
+  gcc_assert (offset < total_size);
+
+  offset -= header_byte_delta;
+  unsigned int i;
+
+  for (i = 0; offset >= chunks[i].size; i++)
+    offset -= chunks[i].size;
+
+  return chunks[i].data + offset;
+}
+
+/* BRIG string data hashing.  */
+
+struct brig_string_slot
+{
+  const char *s;
+  char prefix;
+  int len;
+  uint32_t offset;
+};
+
+/* Hash table helpers.  */
+
+struct brig_string_slot_hasher : pointer_hash <brig_string_slot>
+{
+  static inline hashval_t hash (const value_type);
+  static inline bool equal (const value_type, const compare_type);
+  static inline void remove (value_type);
+};
+
+/* Returns a hash code for DS.  Adapted from libiberty's htab_hash_string
+   to support strings that may not end in '\0'.  */
+
+inline hashval_t
+brig_string_slot_hasher::hash (const value_type ds)
+{
+  hashval_t r = ds->len;
+  int i;
+
+  for (i = 0; i < ds->len; i++)
+     r = r * 67 + (unsigned)ds->s[i] - 113;
+  r = r * 67 + (unsigned)ds->prefix - 113;
+  return r;
+}
+
+/* Returns nonzero if DS1 and DS2 are equal.  */
+
+inline bool
+brig_string_slot_hasher::equal (const value_type ds1, const compare_type ds2)
+{
+  if (ds1->len == ds2->len)
+    return ds1->prefix == ds2->prefix && memcmp (ds1->s, ds2->s, ds1->len) == 0;
+
+  return 0;
+}
+
+/* Deallocate memory for DS upon its removal.  */
+
+inline void
+brig_string_slot_hasher::remove (value_type ds)
+{
+  free (const_cast<char*> (ds->s));
+  free (ds);
+}
+
+/* Hash for strings we output in order not to duplicate them needlessly.  */
+
+static hash_table<brig_string_slot_hasher> *brig_string_htab;
+
+/* Emit a null terminated string STR to the data section and return its
+   offset in it.  If PREFIX is non-zero, output it just before STR too.
+   Sanitize the string if SANITIZE option is set to true.  */
+
+static unsigned
+brig_emit_string (const char *str, char prefix = 0, bool sanitize = true)
+{
+  unsigned slen = strlen (str);
+  unsigned offset, len = slen + (prefix ? 1 : 0);
+  uint32_t hdr_len = htole32 (len);
+  brig_string_slot s_slot;
+  brig_string_slot **slot;
+  char *str2;
+
+  /* XXX Sanitize the names without all the strdup.  */
+  str2 = xstrdup (str);
+
+  if (sanitize)
+    hsa_sanitize_name (str2);
+  s_slot.s = str2;
+  s_slot.len = slen;
+  s_slot.prefix = prefix;
+  s_slot.offset = 0;
+
+  slot = brig_string_htab->find_slot (&s_slot, INSERT);
+  if (*slot == NULL)
+    {
+      brig_string_slot *new_slot = XCNEW (brig_string_slot);
+
+      /* In theory we should fill in BrigData but that would mean copying
+         the string to a buffer for no reason, so we just emulate it. */
+      offset = brig_data.add (&hdr_len, sizeof (hdr_len));
+      if (prefix)
+        brig_data.add (&prefix, 1);
+
+      brig_data.add (str2, slen);
+      brig_data.round_size_up (4);
+
+      /* XXX could use the string we just copied into brig_string->cur_chunk */
+      new_slot->s = str2;
+      new_slot->len = slen;
+      new_slot->prefix = prefix;
+      new_slot->offset = offset;
+      *slot = new_slot;
+    }
+  else
+    {
+      offset = (*slot)->offset;
+      free (str2);
+    }
+
+  return offset;
+}
+
+/* Linked list of queued operands.  */
+
+static struct operand_queue
+{
+  /* First from the chain of queued operands.  */
+  hsa_op_base *first_op, *last_op;
+
+  /* The offset at which the next operand will be enqueued.  */
+  unsigned projected_size;
+
+} op_queue;
+
+/* Unless already initialized, initialize infrastructure to produce BRIG.  */
+
+static void
+brig_init (void)
+{
+  brig_insn_count = 0;
+
+  if (brig_initialized)
+    return;
+
+  brig_string_htab = new hash_table<brig_string_slot_hasher> (37);
+  brig_data.init (BRIG_SECTION_DATA_NAME);
+  brig_code.init (BRIG_SECTION_CODE_NAME);
+  brig_operand.init (BRIG_SECTION_OPERAND_NAME);
+  brig_initialized = true;
+
+  struct BrigDirectiveModule moddir;
+  memset (&moddir, 0, sizeof (moddir));
+  moddir.base.byteCount = htole16 (sizeof (moddir));
+
+  char *modname;
+  if (main_input_filename && *main_input_filename != '\0')
+    {
+      const char *part = strrchr (main_input_filename, '/');
+      if (!part)
+	part = main_input_filename;
+      else
+	part++;
+      asprintf (&modname, "&__hsa_module_%s", part);
+      char* extension = strchr (modname, '.');
+      if (extension)
+	*extension = '\0';
+
+      /* As in LTO mode, we have to emit a different module names.  */
+      if (flag_ltrans)
+	{
+	  part = strrchr (asm_file_name, '/');
+	  if (!part)
+	    part = asm_file_name;
+	  else
+	    part++;
+	  char *modname2;
+	  asprintf (&modname2, "%s_%s", modname, part);
+	  free (modname);
+	  modname = modname2;
+	}
+
+      hsa_sanitize_name (modname);
+      moddir.name = brig_emit_string (modname);
+      free (modname);
+    }
+  else
+    moddir.name = brig_emit_string ("__hsa_module_unnamed", '&');
+  moddir.base.kind = htole16 (BRIG_KIND_DIRECTIVE_MODULE);
+  moddir.hsailMajor = htole32 (BRIG_VERSION_HSAIL_MAJOR) ;
+  moddir.hsailMinor = htole32 (BRIG_VERSION_HSAIL_MINOR);
+  moddir.profile = hsa_full_profile_p () ? BRIG_PROFILE_FULL: BRIG_PROFILE_BASE;
+  if (hsa_machine_large_p ())
+    moddir.machineModel = BRIG_MACHINE_LARGE;
+  else
+    moddir.machineModel = BRIG_MACHINE_SMALL;
+  moddir.defaultFloatRound = BRIG_ROUND_FLOAT_DEFAULT;
+  brig_code.add (&moddir, sizeof (moddir));
+}
+
+/* Free all BRIG data.  */
+
+static void
+brig_release_data (void)
+{
+  delete brig_string_htab;
+  brig_data.release ();
+  brig_code.release ();
+  brig_operand.release ();
+
+  brig_initialized = 0;
+}
+
+/* Enqueue operation OP.  Return the offset at which it will be stored.  */
+
+static unsigned int
+enqueue_op (hsa_op_base *op)
+{
+  unsigned ret;
+
+  if (op->m_brig_op_offset)
+    return op->m_brig_op_offset;
+
+  ret = op_queue.projected_size;
+  op->m_brig_op_offset = op_queue.projected_size;
+
+  if (!op_queue.first_op)
+    op_queue.first_op = op;
+  else
+    op_queue.last_op->m_next = op;
+  op_queue.last_op = op;
+
+  if (is_a <hsa_op_immed *> (op))
+    op_queue.projected_size += sizeof (struct BrigOperandConstantBytes);
+  else if (is_a <hsa_op_reg *> (op))
+    op_queue.projected_size += sizeof (struct BrigOperandRegister);
+  else if (is_a <hsa_op_address *> (op))
+    op_queue.projected_size += sizeof (struct BrigOperandAddress);
+  else if (is_a <hsa_op_code_ref *> (op))
+    op_queue.projected_size += sizeof (struct BrigOperandCodeRef);
+  else if (is_a <hsa_op_code_list *> (op))
+    op_queue.projected_size += sizeof (struct BrigOperandCodeList);
+  else if (is_a <hsa_op_operand_list *> (op))
+    op_queue.projected_size += sizeof (struct BrigOperandOperandList);
+  else
+    gcc_unreachable ();
+  return ret;
+}
+
+
+/* Emit directive describing a symbol if it has not been emitted already.
+   Return the offset of the directive.  */
+
+static unsigned
+emit_directive_variable (struct hsa_symbol *symbol)
+{
+  struct BrigDirectiveVariable dirvar;
+  unsigned name_offset;
+  static unsigned res_name_offset;
+  char prefix;
+
+  if (symbol->m_directive_offset)
+    return symbol->m_directive_offset;
+
+  memset (&dirvar, 0, sizeof (dirvar));
+  dirvar.base.byteCount = htole16 (sizeof (dirvar));
+  dirvar.base.kind = htole16 (BRIG_KIND_DIRECTIVE_VARIABLE);
+  dirvar.allocation = BRIG_ALLOCATION_AUTOMATIC;
+
+  /* Readonly variables must have agent allocation.  */
+  if (symbol->m_cst_value)
+    dirvar.allocation = BRIG_ALLOCATION_AGENT;
+
+  if (symbol->m_decl && is_global_var (symbol->m_decl))
+    {
+      prefix = '&';
+
+      if (!symbol->m_cst_value)
+	{
+	  dirvar.allocation = BRIG_ALLOCATION_PROGRAM;
+	  if (TREE_CODE (symbol->m_decl) == VAR_DECL)
+	    warning (0, "referring to global symbol %q+D by name from HSA code "
+		     "won't work", symbol->m_decl);
+	}
+    }
+  else if (symbol->m_global_scope_p)
+    prefix = '&';
+  else
+    prefix = '%';
+
+  if (symbol->m_decl && TREE_CODE (symbol->m_decl) == RESULT_DECL)
+    {
+      if (res_name_offset == 0)
+	res_name_offset = brig_emit_string (symbol->m_name, '%');
+      name_offset = res_name_offset;
+    }
+  else if (symbol->m_name)
+    name_offset = brig_emit_string (symbol->m_name, prefix);
+  else
+    {
+      char buf[64];
+      sprintf (buf, "__%s_%i", hsa_seg_name (symbol->m_segment),
+	       symbol->m_name_number);
+      name_offset = brig_emit_string (buf, prefix);
+    }
+
+  dirvar.name = htole32 (name_offset);
+  dirvar.init = 0;
+  dirvar.type = htole16 (symbol->m_type);
+  dirvar.segment = symbol->m_segment;
+  /* TODO: Once we are able to access global variables, we must copy their
+     alignment.  */
+  dirvar.align = MAX (hsa_natural_alignment (dirvar.type),
+		      (BrigAlignment8_t) BRIG_ALIGNMENT_4);
+  dirvar.linkage = symbol->m_linkage;
+  dirvar.dim.lo = (uint32_t) symbol->m_dim;
+  dirvar.dim.hi = (uint32_t) ((unsigned long long) symbol->m_dim >> 32);
+  dirvar.modifier.allBits |= BRIG_VARIABLE_DEFINITION;
+  dirvar.reserved = 0;
+
+  if (symbol->m_cst_value)
+    {
+      dirvar.modifier.allBits |= BRIG_VARIABLE_CONST;
+      dirvar.init = htole32 (enqueue_op (symbol->m_cst_value));
+    }
+
+  symbol->m_directive_offset = brig_code.add (&dirvar, sizeof (dirvar));
+  return symbol->m_directive_offset;
+}
+
+/* Emit directives describing either a function declaration or
+   definition F.  */
+
+static BrigDirectiveExecutable *
+emit_function_directives (hsa_function_representation *f, bool is_declaration)
+{
+  struct BrigDirectiveExecutable fndir;
+  unsigned name_offset, inarg_off, scoped_off, next_toplev_off;
+  int count = 0;
+  BrigDirectiveExecutable *ptr_to_fndir;
+  hsa_symbol *sym;
+
+  if (!f->m_declaration_p)
+    for (int i = 0; f->m_readonly_variables.iterate (i, &sym); i++)
+      {
+	emit_directive_variable (sym);
+	brig_insn_count++;
+      }
+
+  name_offset = brig_emit_string (f->m_name, '&');
+  inarg_off = brig_code.total_size + sizeof(fndir)
+    + (f->m_output_arg ? sizeof (struct BrigDirectiveVariable) : 0);
+  scoped_off = inarg_off
+    + f->m_input_args.length () * sizeof (struct BrigDirectiveVariable);
+
+  if (!f->m_declaration_p)
+    {
+      count += f->m_spill_symbols.length ();
+      count += f->m_private_variables.length ();
+    }
+
+  next_toplev_off = scoped_off + count * sizeof (struct BrigDirectiveVariable);
+
+  memset (&fndir, 0, sizeof (fndir));
+  fndir.base.byteCount = htole16 (sizeof (fndir));
+  fndir.base.kind = htole16 (f->m_kern_p ? BRIG_KIND_DIRECTIVE_KERNEL
+			     : BRIG_KIND_DIRECTIVE_FUNCTION);
+  fndir.name = htole32 (name_offset);
+  fndir.inArgCount = htole16 (f->m_input_args.length ());
+  fndir.outArgCount = htole16 (f->m_output_arg ? 1 : 0);
+  fndir.firstInArg = htole32 (inarg_off);
+  fndir.firstCodeBlockEntry = htole32 (scoped_off);
+  fndir.nextModuleEntry = htole32 (next_toplev_off);
+  fndir.linkage = f->m_kern_p || TREE_PUBLIC (f->m_decl) ?
+    BRIG_LINKAGE_PROGRAM : BRIG_LINKAGE_MODULE;
+
+  if (!f->m_declaration_p)
+    fndir.modifier.allBits |= BRIG_EXECUTABLE_DEFINITION;
+  memset (&fndir.reserved, 0, sizeof (fndir.reserved));
+
+  /* Once we put a definition of function_offsets, we should not overwrite
+     it with a declaration of the function.  */
+  if (!function_offsets->get (f->m_decl) || !is_declaration)
+    function_offsets->put (f->m_decl, brig_code.total_size);
+
+  brig_code.add (&fndir, sizeof (fndir));
+  /* XXX terrible hack: we need to set instCount after we emit all
+     insns, but we need to emit directive in order, and we emit directives
+     during insn emitting.  So we need to emit the FUNCTION directive
+     early, then the insns, and then we need to set instCount, so remember
+     a pointer to it, in some horrible way.  cur_chunk.data+size points
+     directly to after fndir here.  */
+  ptr_to_fndir
+      = (BrigDirectiveExecutable *)(brig_code.cur_chunk->data
+                                    + brig_code.cur_chunk->size
+                                    - sizeof (fndir));
+
+  if (f->m_output_arg)
+    emit_directive_variable (f->m_output_arg);
+  for (unsigned i = 0; i < f->m_input_args.length (); i++)
+    emit_directive_variable (f->m_input_args[i]);
+
+  if (!f->m_declaration_p)
+    {
+      for (int i = 0; f->m_spill_symbols.iterate (i, &sym); i++)
+	{
+	  emit_directive_variable (sym);
+	  brig_insn_count++;
+	}
+      for (unsigned i = 0; i < f->m_private_variables.length (); i++)
+	{
+	  emit_directive_variable (f->m_private_variables[i]);
+	  brig_insn_count++;
+	}
+    }
+
+  return ptr_to_fndir;
+}
+
+/* Emit a label directive for the given HBB.  We assume it is about to start on
+   the current offset in the code section.  */
+
+static void
+emit_bb_label_directive (hsa_bb *hbb)
+{
+  struct BrigDirectiveLabel lbldir;
+  char buf[32];
+
+  lbldir.base.byteCount = htole16 (sizeof (lbldir));
+  lbldir.base.kind = htole16 (BRIG_KIND_DIRECTIVE_LABEL);
+  sprintf (buf, "BB_%u_%i", DECL_UID (current_function_decl), hbb->m_index);
+  lbldir.name = htole32 (brig_emit_string (buf, '@'));
+
+  hbb->m_label_ref.m_directive_offset = brig_code.add (&lbldir, sizeof (lbldir));
+  brig_insn_count++;
+}
+
+/* Map a normal HSAIL type to the type of the equivalent BRIG operand
+   holding such, for constants and registers.  */
+
+static BrigType16_t
+regtype_for_type (BrigType16_t t)
+{
+  switch (t)
+    {
+    case BRIG_TYPE_B1:
+      return BRIG_TYPE_B1;
+
+    case BRIG_TYPE_U8:
+    case BRIG_TYPE_U16:
+    case BRIG_TYPE_U32:
+    case BRIG_TYPE_S8:
+    case BRIG_TYPE_S16:
+    case BRIG_TYPE_S32:
+    case BRIG_TYPE_B8:
+    case BRIG_TYPE_B16:
+    case BRIG_TYPE_B32:
+    case BRIG_TYPE_F16:
+    case BRIG_TYPE_F32:
+    case BRIG_TYPE_U8X4:
+    case BRIG_TYPE_U16X2:
+    case BRIG_TYPE_S8X4:
+    case BRIG_TYPE_S16X2:
+    case BRIG_TYPE_F16X2:
+      return BRIG_TYPE_B32;
+
+    case BRIG_TYPE_U64:
+    case BRIG_TYPE_S64:
+    case BRIG_TYPE_F64:
+    case BRIG_TYPE_B64:
+    case BRIG_TYPE_U8X8:
+    case BRIG_TYPE_U16X4:
+    case BRIG_TYPE_U32X2:
+    case BRIG_TYPE_S8X8:
+    case BRIG_TYPE_S16X4:
+    case BRIG_TYPE_S32X2:
+    case BRIG_TYPE_F16X4:
+    case BRIG_TYPE_F32X2:
+      return BRIG_TYPE_B64;
+
+    case BRIG_TYPE_B128:
+    case BRIG_TYPE_U8X16:
+    case BRIG_TYPE_U16X8:
+    case BRIG_TYPE_U32X4:
+    case BRIG_TYPE_U64X2:
+    case BRIG_TYPE_S8X16:
+    case BRIG_TYPE_S16X8:
+    case BRIG_TYPE_S32X4:
+    case BRIG_TYPE_S64X2:
+    case BRIG_TYPE_F16X8:
+    case BRIG_TYPE_F32X4:
+    case BRIG_TYPE_F64X2:
+      return BRIG_TYPE_B128;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Return the length of the BRIG type TYPE that is going to be streamed out as
+   an immediate constant (so it must not be B1).  */
+
+unsigned
+hsa_get_imm_brig_type_len (BrigType16_t type)
+{
+  BrigType16_t base_type = type & BRIG_TYPE_BASE_MASK;
+  BrigType16_t pack_type = type & BRIG_TYPE_PACK_MASK;
+
+  switch (pack_type)
+    {
+    case BRIG_TYPE_PACK_NONE:
+      break;
+    case BRIG_TYPE_PACK_32:
+      return 4;
+    case BRIG_TYPE_PACK_64:
+      return 8;
+    case BRIG_TYPE_PACK_128:
+      return 16;
+    default:
+      gcc_unreachable ();
+    }
+
+  switch (base_type)
+    {
+    case BRIG_TYPE_U8:
+    case BRIG_TYPE_S8:
+    case BRIG_TYPE_B8:
+      return 1;
+    case BRIG_TYPE_U16:
+    case BRIG_TYPE_S16:
+    case BRIG_TYPE_F16:
+    case BRIG_TYPE_B16:
+      return 2;
+    case BRIG_TYPE_U32:
+    case BRIG_TYPE_S32:
+    case BRIG_TYPE_F32:
+    case BRIG_TYPE_B32:
+      return 4;
+    case BRIG_TYPE_U64:
+    case BRIG_TYPE_S64:
+    case BRIG_TYPE_F64:
+    case BRIG_TYPE_B64:
+      return 8;
+    case BRIG_TYPE_B128:
+      return 16;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Emit one scalar VALUE to the buffer DATA intended for BRIG emission.
+   If NEED_LEN is not equal to zero, shrink or extend the value
+   to NEED_LEN bytes.  Return how many bytes were written.  */
+
+static int
+emit_immediate_scalar_to_buffer (tree value, char *data, unsigned need_len)
+{
+  union hsa_bytes bytes;
+
+  memset (&bytes, 0, sizeof (bytes));
+  tree type = TREE_TYPE (value);
+  gcc_checking_assert (TREE_CODE (type) != VECTOR_TYPE);
+
+  unsigned data_len = tree_to_uhwi (TYPE_SIZE (type))/BITS_PER_UNIT;
+  if (INTEGRAL_TYPE_P (type)
+      || (POINTER_TYPE_P (type) && TREE_CODE (value) == INTEGER_CST))
+    switch (data_len)
+      {
+      case 1:
+	bytes.b8 = (uint8_t) TREE_INT_CST_LOW (value);
+	break;
+      case 2:
+	bytes.b16 = (uint16_t) TREE_INT_CST_LOW (value);
+	break;
+      case 4:
+	bytes.b32 = (uint32_t) TREE_INT_CST_LOW (value);
+	break;
+      case 8:
+	bytes.b64 = (uint64_t) int_cst_value (value);
+	break;
+      default:
+	gcc_unreachable ();
+      }
+  else if (SCALAR_FLOAT_TYPE_P (type))
+    {
+      if (data_len == 2)
+	{
+	  sorry ("Support for HSA does not implement immediate 16 bit FPU "
+		 "operands");
+	  return 2;
+	}
+      unsigned int_len = GET_MODE_SIZE (TYPE_MODE (type));
+      /* There are always 32 bits in each long, no matter the size of
+	 the hosts long.  */
+      long tmp[6];
+
+      real_to_target (tmp, TREE_REAL_CST_PTR (value), TYPE_MODE (type));
+
+      if (int_len == 4)
+	bytes.b32 = (uint32_t) tmp[0];
+      else
+	{
+	  bytes.b64 = (uint64_t)(uint32_t) tmp[1];
+	  bytes.b64 <<= 32;
+	  bytes.b64 |= (uint32_t) tmp[0];
+	}
+    }
+  else
+    gcc_unreachable ();
+
+  int len;
+  if (need_len == 0)
+    len = data_len;
+  else
+    len = need_len;
+
+  memcpy (data, &bytes, len);
+  return len;
+}
+
+void
+hsa_op_immed::emit_to_buffer (tree value)
+{
+  unsigned total_len = m_brig_repr_size;
+
+  /* As we can have a constructor with fewer elements, fill the memory
+     with zeros.  */
+  m_brig_repr = XCNEWVEC (char, total_len);
+  char *p = m_brig_repr;
+
+  if (TREE_CODE (value) == VECTOR_CST)
+    {
+      int i, num = VECTOR_CST_NELTS (value);
+      for (i = 0; i < num; i++)
+	{
+	  unsigned actual;
+	  actual = emit_immediate_scalar_to_buffer
+	    (VECTOR_CST_ELT (value, i), p, 0);
+	  total_len -= actual;
+	  p += actual;
+	}
+      /* Vectors should have the exact size.  */
+      gcc_assert (total_len == 0);
+    }
+  else if (TREE_CODE (value) == STRING_CST)
+    memcpy (m_brig_repr, TREE_STRING_POINTER (value),
+	    TREE_STRING_LENGTH (value));
+  else if (TREE_CODE (value) == COMPLEX_CST)
+    {
+      gcc_assert (total_len % 2 == 0);
+      unsigned actual;
+      actual = emit_immediate_scalar_to_buffer
+	(TREE_REALPART (value), p, total_len / 2);
+
+      gcc_assert (actual == total_len / 2);
+      p += actual;
+
+      actual = emit_immediate_scalar_to_buffer
+	(TREE_IMAGPART (value), p, total_len / 2);
+      gcc_assert (actual == total_len / 2);
+    }
+  else if (TREE_CODE (value) == CONSTRUCTOR)
+    {
+      unsigned len = vec_safe_length (CONSTRUCTOR_ELTS (value));
+      for (unsigned i = 0; i < len; i++)
+	{
+	  unsigned actual = emit_immediate_scalar_to_buffer
+	    (CONSTRUCTOR_ELT (value, i)->value, p, 0);
+	  total_len -= actual;
+	  p += actual;
+	}
+    }
+  else
+    emit_immediate_scalar_to_buffer (value, p, total_len);
+}
+
+/* Emit an immediate BRIG operand IMM.  The BRIG type of the immediate might
+   have been massaged to comply with various HSA/BRIG type requirements, so the
+   only important aspect of that is the length (because HSAIL might expect
+   smaller constants or become bit-data).  The data should be represented
+   according to what is in the tree representation.  */
+
+static void
+emit_immediate_operand (hsa_op_immed *imm)
+{
+  struct BrigOperandConstantBytes out;
+
+  memset (&out, 0, sizeof (out));
+  out.base.byteCount = htole16 (sizeof (out));
+  out.base.kind = htole16 (BRIG_KIND_OPERAND_CONSTANT_BYTES);
+  uint32_t byteCount = htole32 (imm->m_brig_repr_size);
+  out.type = htole16 (imm->m_type);
+  out.bytes = htole32 (brig_data.add (&byteCount, sizeof (byteCount)));
+  brig_operand.add (&out, sizeof(out));
+  brig_data.add (imm->m_brig_repr, imm->m_brig_repr_size);
+  brig_data.round_size_up (4);
+}
+
+/* Emit a register BRIG operand REG.  */
+
+static void
+emit_register_operand (hsa_op_reg *reg)
+{
+  struct BrigOperandRegister out;
+
+  out.base.byteCount = htole16 (sizeof (out));
+  out.base.kind = htole16 (BRIG_KIND_OPERAND_REGISTER);
+  out.regNum = htole32 (reg->m_hard_num);
+
+  switch (regtype_for_type (reg->m_type))
+    {
+    case BRIG_TYPE_B32:
+      out.regKind = BRIG_REGISTER_KIND_SINGLE;
+      break;
+    case BRIG_TYPE_B64:
+      out.regKind = BRIG_REGISTER_KIND_DOUBLE;
+      break;
+    case BRIG_TYPE_B128:
+      out.regKind = BRIG_REGISTER_KIND_QUAD;
+      break;
+    case BRIG_TYPE_B1:
+      out.regKind = BRIG_REGISTER_KIND_CONTROL;
+      break;
+    default:
+      gcc_unreachable ();
+    }
+
+  brig_operand.add (&out, sizeof (out));
+}
+
+/* Emit an address BRIG operand ADDR.  */
+
+static void
+emit_address_operand (hsa_op_address *addr)
+{
+  struct BrigOperandAddress out;
+
+  out.base.byteCount = htole16 (sizeof (out));
+  out.base.kind = htole16 (BRIG_KIND_OPERAND_ADDRESS);
+  out.symbol = addr->m_symbol
+    ? htole32 (emit_directive_variable (addr->m_symbol)) : 0;
+  out.reg = addr->m_reg ? htole32 (enqueue_op (addr->m_reg)) : 0;
+
+  if (sizeof (addr->m_imm_offset) == 8)
+    {
+      out.offset.lo = htole32 ((uint32_t)addr->m_imm_offset);
+      out.offset.hi = htole32 (((long long) addr->m_imm_offset) >> 32);
+    }
+  else
+    {
+      gcc_assert (sizeof (addr->m_imm_offset) == 4);
+      out.offset.lo = htole32 (addr->m_imm_offset);
+      out.offset.hi = 0;
+    }
+
+  brig_operand.add (&out, sizeof (out));
+}
+
+/* Emit a code reference operand REF.  */
+
+static void
+emit_code_ref_operand (hsa_op_code_ref *ref)
+{
+  struct BrigOperandCodeRef out;
+
+  out.base.byteCount = htole16 (sizeof (out));
+  out.base.kind = htole16 (BRIG_KIND_OPERAND_CODE_REF);
+  out.ref = htole32 (ref->m_directive_offset);
+  brig_operand.add (&out, sizeof (out));
+}
+
+/* Emit a code list operand CODE_LIST.  */
+
+static void
+emit_code_list_operand (hsa_op_code_list *code_list)
+{
+  struct BrigOperandCodeList out;
+  unsigned args = code_list->m_offsets.length ();
+
+  for (unsigned i = 0; i < args; i++)
+    gcc_assert (code_list->m_offsets[i]);
+
+  out.base.byteCount = htole16 (sizeof (out));
+  out.base.kind = htole16 (BRIG_KIND_OPERAND_CODE_LIST);
+
+  uint32_t byteCount = htole32 (4 * args);
+
+  out.elements = htole32 (brig_data.add (&byteCount, sizeof (byteCount)));
+  brig_data.add (code_list->m_offsets.address (), args * sizeof (uint32_t));
+  brig_data.round_size_up (4);
+  brig_operand.add (&out, sizeof (out));
+}
+
+/* Emit an operand list operand OPERAND_LIST.  */
+
+static void
+emit_operand_list_operand (hsa_op_operand_list *operand_list)
+{
+  struct BrigOperandOperandList out;
+  unsigned args = operand_list->m_offsets.length ();
+
+  for (unsigned i = 0; i < args; i++)
+    gcc_assert (operand_list->m_offsets[i]);
+
+  out.base.byteCount = htole16 (sizeof (out));
+  out.base.kind = htole16 (BRIG_KIND_OPERAND_OPERAND_LIST);
+
+  uint32_t byteCount = htole32 (4 * args);
+
+  out.elements = htole32 (brig_data.add (&byteCount, sizeof (byteCount)));
+  brig_data.add (operand_list->m_offsets.address (), args * sizeof (uint32_t));
+  brig_data.round_size_up (4);
+  brig_operand.add (&out, sizeof (out));
+}
+
+/* Emit all operands queued for writing.  */
+
+static void
+emit_queued_operands (void)
+{
+  for (hsa_op_base *op = op_queue.first_op; op; op = op->m_next)
+    {
+      gcc_assert (op->m_brig_op_offset == brig_operand.total_size);
+      if (hsa_op_immed *imm = dyn_cast <hsa_op_immed *> (op))
+	emit_immediate_operand (imm);
+      else if (hsa_op_reg *reg = dyn_cast <hsa_op_reg *> (op))
+	emit_register_operand (reg);
+      else if (hsa_op_address *addr = dyn_cast <hsa_op_address *> (op))
+	emit_address_operand (addr);
+      else if (hsa_op_code_ref *ref = dyn_cast <hsa_op_code_ref *> (op))
+	emit_code_ref_operand (ref);
+      else if (hsa_op_code_list *code_list = dyn_cast <hsa_op_code_list *> (op))
+	emit_code_list_operand (code_list);
+      else if (hsa_op_operand_list *l = dyn_cast <hsa_op_operand_list *> (op))
+	emit_operand_list_operand (l);
+      else
+	gcc_unreachable ();
+    }
+}
+
+/* Emit directives describing the function that is used for
+a function declaration.  */
+
+static BrigDirectiveExecutable *
+emit_function_declaration (tree decl)
+{
+  hsa_function_representation *f = hsa_generate_function_declaration (decl);
+
+  BrigDirectiveExecutable *e = emit_function_directives (f, true);
+  emit_queued_operands ();
+
+  delete f;
+
+  return e;
+}
+
+/* Enqueue all operands of INSN and return offset to BRIG data section
+   to list of operand offsets.  */
+
+static unsigned
+emit_insn_operands (hsa_insn_basic *insn)
+{
+  auto_vec<BrigOperandOffset32_t, HSA_BRIG_INT_STORAGE_OPERANDS>
+    operand_offsets;
+
+  unsigned l = insn->operand_count ();
+  operand_offsets.safe_grow (l);
+
+  for (unsigned i = 0; i < l; i++)
+    operand_offsets[i] = htole32 (enqueue_op (insn->get_op (i)));
+
+  /* We have N operands so use 4 * N for the byte_count.  */
+  uint32_t byte_count = htole32 (4 * l);
+
+  unsigned offset = brig_data.add (&byte_count, sizeof (byte_count));
+  brig_data.add (operand_offsets.address (),
+		 l * sizeof (BrigOperandOffset32_t));
+
+  brig_data.round_size_up (4);
+
+  return offset;
+}
+
+/* Enqueue operand OP0, OP1, OP2 (if different from NULL) and return offset
+   to BRIG data section to list of operand offsets.  */
+
+static unsigned
+emit_operands (hsa_op_base *op0, hsa_op_base *op1 = NULL,
+	       hsa_op_base *op2 = NULL)
+{
+  auto_vec<BrigOperandOffset32_t, HSA_BRIG_INT_STORAGE_OPERANDS>
+    operand_offsets;
+
+  gcc_checking_assert (op0 != NULL);
+  operand_offsets.safe_push (enqueue_op (op0));
+
+  if (op1 != NULL)
+    {
+      operand_offsets.safe_push (enqueue_op (op1));
+      if (op2 != NULL)
+	operand_offsets.safe_push (enqueue_op (op2));
+    }
+
+  unsigned l = operand_offsets.length ();
+
+  /* We have N operands so use 4 * N for the byte_count.  */
+  uint32_t byte_count = htole32 (4 * l);
+
+  unsigned offset = brig_data.add (&byte_count, sizeof (byte_count));
+  brig_data.add (operand_offsets.address (),
+		 l * sizeof (BrigOperandOffset32_t));
+
+  brig_data.round_size_up (4);
+
+  return offset;
+}
+
+/* Emit an HSA memory instruction and all necessary directives, schedule
+   necessary operands for writing .  */
+
+static void
+emit_memory_insn (hsa_insn_mem *mem)
+{
+  struct BrigInstMem repr;
+  gcc_checking_assert (mem->operand_count () == 2);
+
+  hsa_op_address *addr = as_a <hsa_op_address *> (mem->get_op (1));
+
+  /* This is necessary because of the erroneous typedef of
+     BrigMemoryModifier8_t which introduces padding which may then contain
+     random stuff (which we do not want so that we can test things don't
+     change).  */
+  memset (&repr, 0, sizeof (repr));
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_MEM);
+  repr.base.opcode = htole16 (mem->m_opcode);
+  repr.base.type = htole16 (mem->m_type);
+  repr.base.operands = htole32 (emit_insn_operands (mem));
+
+  if (addr->m_symbol)
+    repr.segment = addr->m_symbol->m_segment;
+  else
+    repr.segment = BRIG_SEGMENT_FLAT;
+  repr.modifier.allBits = 0 ;
+  repr.equivClass = mem->m_equiv_class;
+  repr.align = mem->m_align;
+  if (mem->m_opcode == BRIG_OPCODE_LD)
+    repr.width = BRIG_WIDTH_1;
+  else
+    repr.width = BRIG_WIDTH_NONE;
+  memset (&repr.reserved, 0, sizeof (repr.reserved));
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit an HSA signal memory instruction and all necessary directives, schedule
+   necessary operands for writing.  */
+
+static void
+emit_signal_insn (hsa_insn_signal *mem)
+{
+  struct BrigInstSignal repr;
+
+  /* This is necessary because of the erroneous typedef of
+     BrigMemoryModifier8_t which introduces padding which may then contain
+     random stuff (which we do not want so that we can test things don't
+     change).  */
+  memset (&repr, 0, sizeof (repr));
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_SIGNAL);
+  repr.base.opcode = htole16 (mem->m_opcode);
+  repr.base.type = htole16 (mem->m_type);
+  repr.base.operands = htole32 (emit_insn_operands (mem));
+
+  repr.memoryOrder = mem->m_memoryorder;
+  repr.signalOperation = mem->m_atomicop;
+  repr.signalType = BRIG_TYPE_SIG64;
+
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit an HSA atomic memory instruction and all necessary directives, schedule
+   necessary operands for writing .  */
+
+static void
+emit_atomic_insn (hsa_insn_atomic *mem)
+{
+  struct BrigInstAtomic repr;
+
+  /* Either operand[0] or operand[1] must be an address operand.  */
+  hsa_op_address *addr = NULL;
+  if (is_a <hsa_op_address *> (mem->get_op (0)))
+    addr = as_a <hsa_op_address *> (mem->get_op (0));
+  else
+    addr = as_a <hsa_op_address *> (mem->get_op (1));
+
+  /* This is necessary because of the erroneous typedef of
+     BrigMemoryModifier8_t which introduces padding which may then contain
+     random stuff (which we do not want so that we can test things don't
+     change).  */
+  memset (&repr, 0, sizeof (repr));
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_ATOMIC);
+  repr.base.opcode = htole16 (mem->m_opcode);
+  repr.base.type = htole16 (mem->m_type);
+  repr.base.operands = htole32 (emit_insn_operands (mem));
+
+  if (addr->m_symbol)
+    repr.segment = addr->m_symbol->m_segment;
+  else
+    repr.segment = BRIG_SEGMENT_FLAT;
+  repr.memoryOrder = mem->m_memoryorder;
+  repr.memoryScope = mem->m_memoryscope;
+  repr.atomicOperation = mem->m_atomicop;
+
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit an HSA LDA instruction and all necessary directives, schedule
+   necessary operands for writing .  */
+
+static void
+emit_addr_insn (hsa_insn_basic *insn)
+{
+  struct BrigInstAddr repr;
+
+  hsa_op_address *addr = as_a <hsa_op_address *> (insn->get_op (1));
+
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_ADDR);
+  repr.base.opcode = htole16 (insn->m_opcode);
+  repr.base.type = htole16 (insn->m_type);
+  repr.base.operands = htole32 (emit_insn_operands (insn));
+
+  if (addr->m_symbol)
+    repr.segment = addr->m_symbol->m_segment;
+  else
+    repr.segment = BRIG_SEGMENT_FLAT;
+  memset (&repr.reserved, 0, sizeof (repr.reserved));
+
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit an HSA segment conversion instruction and all necessary directives,
+   schedule necessary operands for writing .  */
+
+static void
+emit_segment_insn (hsa_insn_seg *seg)
+{
+  struct BrigInstSegCvt repr;
+
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_SEG_CVT);
+  repr.base.opcode = htole16 (seg->m_opcode);
+  repr.base.type = htole16 (seg->m_type);
+  repr.base.operands = htole32 (emit_insn_operands (seg));
+  repr.sourceType = htole16 (as_a <hsa_op_reg *> (seg->get_op (1))->m_type);
+  repr.segment = seg->m_segment;
+  repr.modifier.allBits = 0;
+
+  brig_code.add (&repr, sizeof (repr));
+
+  brig_insn_count++;
+}
+
+/* Emit an HSA comparison instruction and all necessary directives,
+   schedule necessary operands for writing .  */
+
+static void
+emit_cmp_insn (hsa_insn_cmp *cmp)
+{
+  struct BrigInstCmp repr;
+
+  memset (&repr, 0, sizeof (repr));
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_CMP);
+  repr.base.opcode = htole16 (cmp->m_opcode);
+  repr.base.type = htole16 (cmp->m_type);
+  repr.base.operands = htole32 (emit_insn_operands (cmp));
+
+  if (is_a <hsa_op_reg *> (cmp->get_op (1)))
+    repr.sourceType = htole16 (as_a <hsa_op_reg *> (cmp->get_op (1))->m_type);
+  else
+    repr.sourceType = htole16 (as_a <hsa_op_immed *> (cmp->get_op (1))->m_type);
+  repr.modifier.allBits = 0;
+  repr.compare = cmp->m_compare;
+  repr.pack = 0;
+
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit an HSA branching instruction and all necessary directives, schedule
+   necessary operands for writing .  */
+
+static void
+emit_branch_insn (hsa_insn_br *br)
+{
+  struct BrigInstBr repr;
+
+  basic_block target = NULL;
+  edge_iterator ei;
+  edge e;
+
+  /* At the moment we only handle direct conditional jumps.  */
+  gcc_assert (br->m_opcode == BRIG_OPCODE_CBR);
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_BR);
+  repr.base.opcode = htole16 (br->m_opcode);
+  repr.width = BRIG_WIDTH_1;
+  /* For Conditional jumps the type is always B1.  */
+  repr.base.type = htole16 (BRIG_TYPE_B1);
+
+  FOR_EACH_EDGE (e, ei, br->m_bb->succs)
+    if (e->flags & EDGE_TRUE_VALUE)
+      {
+	target = e->dest;
+	break;
+      }
+  gcc_assert (target);
+
+  repr.base.operands = htole32
+    (emit_operands (br->get_op (0), &hsa_bb_for_bb (target)->m_label_ref));
+  memset (&repr.reserved, 0, sizeof (repr.reserved));
+
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit an HSA unconditional jump branching instruction that points to
+   a label REFERENCE.  */
+
+static void
+emit_unconditional_jump (hsa_op_code_ref *reference)
+{
+  struct BrigInstBr repr;
+
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_BR);
+  repr.base.opcode = htole16 (BRIG_OPCODE_BR);
+  repr.base.type = htole16 (BRIG_TYPE_NONE);
+  /* Direct branches to labels must be width(all).  */
+  repr.width = BRIG_WIDTH_ALL;
+
+  repr.base.operands = htole32 (emit_operands (reference));
+  memset (&repr.reserved, 0, sizeof (repr.reserved));
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit an HSA switch jump instruction that uses a jump table to
+   jump to a destination label.  */
+
+static void
+emit_switch_insn (hsa_insn_sbr *sbr)
+{
+  struct BrigInstBr repr;
+
+  gcc_assert (sbr->m_opcode == BRIG_OPCODE_SBR);
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_BR);
+  repr.base.opcode = htole16 (sbr->m_opcode);
+  repr.width = BRIG_WIDTH_1;
+  /* For Conditional jumps the type is always B1.  */
+  hsa_op_reg *index = as_a <hsa_op_reg *> (sbr->get_op (0));
+  repr.base.type = htole16 (index->m_type);
+  repr.base.operands = htole32
+    (emit_operands (sbr->get_op (0), sbr->m_label_code_list));
+  memset (&repr.reserved, 0, sizeof (repr.reserved));
+
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+
+  /* Emit jump to default label.  */
+  hsa_bb *hbb = hsa_bb_for_bb (sbr->m_default_bb);
+  emit_unconditional_jump (&hbb->m_label_ref);
+}
+
+/* Emit a HSA convert instruction and all necessary directives, schedule
+   necessary operands for writing.  */
+
+static void
+emit_cvt_insn (hsa_insn_cvt *insn)
+{
+  struct BrigInstCvt repr;
+  BrigType16_t srctype;
+
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_CVT);
+  repr.base.opcode = htole16 (insn->m_opcode);
+  repr.base.type = htole16 (insn->m_type);
+  repr.base.operands = htole32 (emit_insn_operands (insn));
+
+  if (is_a <hsa_op_reg *> (insn->get_op (1)))
+    srctype = as_a <hsa_op_reg *> (insn->get_op (1))->m_type;
+  else
+    srctype = as_a <hsa_op_immed *> (insn->get_op (1))->m_type;
+  repr.sourceType = htole16 (srctype);
+  repr.modifier.allBits = 0;
+  /* float to smaller float requires a rounding setting (we default
+     to 'near'.  */
+  if (hsa_type_float_p (insn->m_type)
+      && (!hsa_type_float_p (srctype)
+	  || ((insn->m_type & BRIG_TYPE_BASE_MASK)
+	      < (srctype & BRIG_TYPE_BASE_MASK))))
+    repr.round = BRIG_ROUND_FLOAT_NEAR_EVEN;
+  else if (hsa_type_integer_p (insn->m_type) &&
+	   hsa_type_float_p (srctype))
+    repr.round = BRIG_ROUND_INTEGER_ZERO;
+  else
+    repr.round = BRIG_ROUND_NONE;
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit call instruction INSN, where this instruction must be closed
+   within a call block instruction.  */
+
+static void
+emit_call_insn (hsa_insn_call *call)
+{
+  struct BrigInstBr repr;
+
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_BR);
+  repr.base.opcode = htole16 (BRIG_OPCODE_CALL);
+  repr.base.type = htole16 (BRIG_TYPE_NONE);
+
+  repr.base.operands = htole32
+    (emit_operands (call->m_result_code_list, &call->m_func,
+		    call->m_args_code_list));
+
+  function_call_linkage.safe_push
+    (function_linkage_pair (call->m_called_function,
+			    call->m_func.m_brig_op_offset));
+
+  repr.width = BRIG_WIDTH_ALL;
+  memset (&repr.reserved, 0, sizeof (repr.reserved));
+
+  brig_code.add (&repr, sizeof (repr));
+  brig_insn_count++;
+}
+
+/* Emit argument block directive.  */
+
+static void
+emit_arg_block_insn (hsa_insn_arg_block *insn)
+{
+  switch (insn->m_kind)
+    {
+    case BRIG_KIND_DIRECTIVE_ARG_BLOCK_START:
+      {
+	struct BrigDirectiveArgBlockStart repr;
+	repr.base.byteCount = htole16 (sizeof (repr));
+	repr.base.kind = htole16 (insn->m_kind);
+	brig_code.add (&repr, sizeof (repr));
+
+	for (unsigned i = 0; i < insn->m_call_insn->m_input_args.length (); i++)
+	  {
+	    insn->m_call_insn->m_args_code_list->m_offsets[i] = htole32
+	      (emit_directive_variable (insn->m_call_insn->m_input_args[i]));
+	    brig_insn_count++;
+	  }
+
+	if (insn->m_call_insn->m_output_arg)
+	  {
+	    insn->m_call_insn->m_result_code_list->m_offsets[0] = htole32
+	      (emit_directive_variable (insn->m_call_insn->m_output_arg));
+	    brig_insn_count++;
+	  }
+
+	break;
+      }
+    case BRIG_KIND_DIRECTIVE_ARG_BLOCK_END:
+      {
+	struct BrigDirectiveArgBlockEnd repr;
+	repr.base.byteCount = htole16 (sizeof (repr));
+	repr.base.kind = htole16 (insn->m_kind);
+	brig_code.add (&repr, sizeof (repr));
+	break;
+      }
+    default:
+      gcc_unreachable ();
+    }
+
+  brig_insn_count++;
+}
+
+/* Emit comment directive.  */
+
+static void
+emit_comment_insn (hsa_insn_comment *insn)
+{
+  struct BrigDirectiveComment repr;
+  memset (&repr, 0, sizeof (repr));
+
+  repr.base.byteCount = htole16 (sizeof (repr));
+  repr.base.kind = htole16 (insn->m_opcode);
+  repr.name = brig_emit_string (insn->m_comment, '\0', false);
+  brig_code.add (&repr, sizeof (repr));
+}
+
+/* Emit queue instruction INSN.  */
+
+static void
+emit_queue_insn (hsa_insn_queue *insn)
+{
+  BrigInstQueue repr;
+  memset (&repr, 0, sizeof (repr));
+
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_QUEUE);
+  repr.base.opcode = htole16 (insn->m_opcode);
+  repr.base.type = htole16 (insn->m_type);
+  repr.segment = BRIG_SEGMENT_GLOBAL;
+  repr.memoryOrder = BRIG_MEMORY_ORDER_SC_RELEASE;
+  repr.base.operands = htole32 (emit_insn_operands (insn));
+  brig_data.round_size_up (4);
+  brig_code.add (&repr, sizeof (repr));
+
+  brig_insn_count++;
+}
+
+static void
+emit_packed_insn (hsa_insn_packed *insn)
+{
+  /* We assume that BrigInstMod has a BrigInstBasic prefix.  */
+  struct BrigInstSourceType repr;
+  unsigned operand_count = insn->operand_count ();
+  gcc_checking_assert (operand_count >= 2);
+
+  memset (&repr, 0, sizeof (repr));
+  repr.sourceType = htole16 (insn->m_source_type);
+  repr.base.base.byteCount = htole16 (sizeof (repr));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_SOURCE_TYPE);
+  repr.base.opcode = htole16 (insn->m_opcode);
+  repr.base.type = htole16 (insn->m_type);
+
+  if (insn->m_opcode == BRIG_OPCODE_COMBINE)
+    {
+      /* Create operand list for packed type.  */
+      for (unsigned i = 1; i < operand_count; i++)
+	{
+	  gcc_checking_assert (insn->get_op (i));
+	  insn->m_operand_list->m_offsets[i - 1] = htole32
+	    (enqueue_op (insn->get_op (i)));
+	}
+
+      repr.base.operands = htole32 (emit_operands (insn->get_op (0),
+						   insn->m_operand_list));
+    }
+  else if (insn->m_opcode == BRIG_OPCODE_EXPAND)
+    {
+      /* Create operand list for packed type.  */
+      for (unsigned i = 0; i < operand_count - 1; i++)
+	{
+	  gcc_checking_assert (insn->get_op (i));
+	  insn->m_operand_list->m_offsets[i] = htole32
+	    (enqueue_op (insn->get_op (i)));
+	}
+
+      repr.base.operands = htole32
+	(emit_operands (insn->get_op (insn->operand_count () - 1),
+			insn->m_operand_list));
+    }
+
+
+  brig_code.add (&repr, sizeof (struct BrigInstSourceType));
+  brig_insn_count++;
+}
+
+/* Emit a basic HSA instruction and all necessary directives, schedule
+   necessary operands for writing.  */
+
+static void
+emit_basic_insn (hsa_insn_basic *insn)
+{
+  /* We assume that BrigInstMod has a BrigInstBasic prefix.  */
+  struct BrigInstMod repr;
+  BrigType16_t type;
+
+  memset (&repr, 0, sizeof (repr));
+  repr.base.base.byteCount = htole16 (sizeof (BrigInstBasic));
+  repr.base.base.kind = htole16 (BRIG_KIND_INST_BASIC);
+  repr.base.opcode = htole16 (insn->m_opcode);
+  switch (insn->m_opcode)
+    {
+      /* And the bit-logical operations need bit types and whine about
+         arithmetic types :-/  */
+      case BRIG_OPCODE_AND:
+      case BRIG_OPCODE_OR:
+      case BRIG_OPCODE_XOR:
+      case BRIG_OPCODE_NOT:
+	type = regtype_for_type (insn->m_type);
+	break;
+      default:
+	type = insn->m_type;
+	break;
+    }
+  repr.base.type = htole16 (type);
+  repr.base.operands = htole32 (emit_insn_operands (insn));
+
+  if ((type & BRIG_TYPE_PACK_MASK) != BRIG_TYPE_PACK_NONE)
+    {
+      if (hsa_type_float_p (type)
+	  && !hsa_opcode_floating_bit_insn_p (insn->m_opcode))
+	repr.round = BRIG_ROUND_FLOAT_NEAR_EVEN;
+      else
+	repr.round = 0;
+      /* We assume that destination and sources agree in packing
+         layout.  */
+      if (insn->num_used_ops () >= 2)
+	repr.pack = BRIG_PACK_PP;
+      else
+	repr.pack = BRIG_PACK_P;
+      repr.reserved = 0;
+      repr.base.base.byteCount = htole16 (sizeof (BrigInstMod));
+      repr.base.base.kind = htole16 (BRIG_KIND_INST_MOD);
+      brig_code.add (&repr, sizeof (struct BrigInstMod));
+    }
+  else
+    brig_code.add (&repr, sizeof (struct BrigInstBasic));
+  brig_insn_count++;
+}
+
+/* Emit an HSA instruction and all necessary directives, schedule necessary
+   operands for writing .  */
+
+static void
+emit_insn (hsa_insn_basic *insn)
+{
+  gcc_assert (!is_a <hsa_insn_phi *> (insn));
+
+  insn->m_brig_offset = brig_code.total_size;
+
+  if (hsa_insn_signal *signal = dyn_cast <hsa_insn_signal *> (insn))
+    emit_signal_insn (signal);
+  else if (hsa_insn_atomic *atom = dyn_cast <hsa_insn_atomic *> (insn))
+    emit_atomic_insn (atom);
+  else if (hsa_insn_mem *mem = dyn_cast <hsa_insn_mem *> (insn))
+    emit_memory_insn (mem);
+  else if (insn->m_opcode == BRIG_OPCODE_LDA)
+    emit_addr_insn (insn);
+  else if (hsa_insn_seg *seg = dyn_cast <hsa_insn_seg *> (insn))
+    emit_segment_insn (seg);
+  else if (hsa_insn_cmp *cmp = dyn_cast <hsa_insn_cmp *> (insn))
+    emit_cmp_insn (cmp);
+  else if (hsa_insn_br *br = dyn_cast <hsa_insn_br *> (insn))
+    emit_branch_insn (br);
+  else if (hsa_insn_sbr *sbr = dyn_cast <hsa_insn_sbr *> (insn))
+    {
+      if (switch_instructions == NULL)
+	switch_instructions = new vec <hsa_insn_sbr *> ();
+
+      switch_instructions->safe_push (sbr);
+      emit_switch_insn (sbr);
+    }
+  else if (hsa_insn_arg_block *block = dyn_cast <hsa_insn_arg_block *> (insn))
+    emit_arg_block_insn (block);
+  else if (hsa_insn_call *call = dyn_cast <hsa_insn_call *> (insn))
+    emit_call_insn (call);
+  else if (hsa_insn_comment *comment = dyn_cast <hsa_insn_comment *> (insn))
+    emit_comment_insn (comment);
+  else if (hsa_insn_queue *queue = dyn_cast <hsa_insn_queue *> (insn))
+    emit_queue_insn (queue);
+  else if (hsa_insn_packed *packed = dyn_cast <hsa_insn_packed *> (insn))
+    emit_packed_insn (packed);
+  else if (hsa_insn_cvt *cvt = dyn_cast <hsa_insn_cvt *> (insn))
+    emit_cvt_insn (cvt);
+  else
+    emit_basic_insn (insn);
+}
+
+/* We have just finished emitting BB and are about to emit NEXT_BB if non-NULL,
+   or we are about to finish emitting code, if it is NULL.  If the fall through
+   edge from BB does not lead to NEXT_BB, emit an unconditional jump.  */
+
+static void
+perhaps_emit_branch (basic_block bb, basic_block next_bb)
+{
+  basic_block t_bb = NULL, ff = NULL;
+
+  edge_iterator ei;
+  edge e;
+
+  /* If the last instruction of BB is a switch, ignore emission of all
+     edges.  */
+  if (hsa_bb_for_bb (bb)->m_last_insn
+      && is_a <hsa_insn_sbr *> (hsa_bb_for_bb (bb)->m_last_insn))
+    return;
+
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    if (e->flags & EDGE_TRUE_VALUE)
+      {
+	gcc_assert (!t_bb);
+	t_bb = e->dest;
+      }
+    else
+      {
+	gcc_assert (!ff);
+	ff = e->dest;
+      }
+
+  if (!ff || ff == next_bb || ff == EXIT_BLOCK_PTR_FOR_FN (cfun))
+    return;
+
+  emit_unconditional_jump (&hsa_bb_for_bb (ff)->m_label_ref);
+}
+
+/* Emit the a function with name NAME to the various brig sections.  */
+
+void
+hsa_brig_emit_function (void)
+{
+  basic_block bb, prev_bb;
+  hsa_insn_basic *insn;
+  BrigDirectiveExecutable *ptr_to_fndir;
+
+  brig_init ();
+
+  brig_insn_count = 0;
+  memset (&op_queue, 0, sizeof (op_queue));
+  op_queue.projected_size = brig_operand.total_size;
+
+  if (!function_offsets)
+    function_offsets = new hash_map<tree, BrigCodeOffset32_t> ();
+
+  if (!emitted_declarations)
+    emitted_declarations = new hash_map <tree, BrigDirectiveExecutable *> ();
+
+  for (unsigned i = 0; i < hsa_cfun->m_called_functions.length (); i++)
+    {
+      tree called = hsa_cfun->m_called_functions[i];
+
+      /* If the function has no definition, emit a declaration.  */
+      if (!emitted_declarations->get (called))
+	{
+	  BrigDirectiveExecutable *e = emit_function_declaration (called);
+	  emitted_declarations->put (called, e);
+	}
+    }
+
+  ptr_to_fndir = emit_function_directives (hsa_cfun, false);
+  for (insn = hsa_bb_for_bb (ENTRY_BLOCK_PTR_FOR_FN (cfun))->m_first_insn;
+       insn;
+       insn = insn->m_next)
+    emit_insn (insn);
+  prev_bb = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      perhaps_emit_branch (prev_bb, bb);
+      emit_bb_label_directive (hsa_bb_for_bb (bb));
+      for (insn = hsa_bb_for_bb (bb)->m_first_insn; insn; insn = insn->m_next)
+	emit_insn (insn);
+      prev_bb = bb;
+    }
+  perhaps_emit_branch (prev_bb, NULL);
+  ptr_to_fndir->nextModuleEntry = brig_code.total_size;
+
+  /* Fill up label references for all sbr instructions.  */
+  if (switch_instructions)
+    {
+      for (unsigned i = 0; i < switch_instructions->length (); i++)
+	{
+	  hsa_insn_sbr *sbr = (*switch_instructions)[i];
+	  for (unsigned j = 0; j < sbr->m_jump_table.length (); j++)
+	    {
+	      hsa_bb *hbb = hsa_bb_for_bb (sbr->m_jump_table[j]);
+	      sbr->m_label_code_list->m_offsets[j] =
+		hbb->m_label_ref.m_directive_offset;
+	    }
+	}
+
+      switch_instructions->release ();
+      delete switch_instructions;
+      switch_instructions = NULL;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "------- After BRIG emission: -------\n");
+      dump_hsa_cfun (dump_file);
+    }
+
+  emit_queued_operands ();
+}
+
+/* Emit all OMP symbols related to OMP.  */
+
+void
+hsa_brig_emit_omp_symbols (void)
+{
+  brig_init ();
+  emit_directive_variable (hsa_num_threads);
+}
+
+/* Unit constructor and destructor statements.  */
+
+static GTY(()) tree hsa_ctor_statements;
+static GTY(()) tree hsa_dtor_statements;
+
+/* Create a static constructor that will register out brig stuff with
+   libgomp.  */
+
+static void
+hsa_output_kernel_mapping (tree brig_decl)
+{
+  unsigned map_count = hsa_get_number_decl_kernel_mappings ();
+
+  tree int_num_of_kernels;
+  int_num_of_kernels = build_int_cst (uint32_type_node, map_count);
+  tree kernel_num_index_type = build_index_type (int_num_of_kernels);
+  tree host_functions_array_type = build_array_type (ptr_type_node,
+						     kernel_num_index_type);
+
+  vec<constructor_elt, va_gc> *host_functions_vec = NULL;
+  for (unsigned i = 0; i < map_count; ++i)
+    {
+      tree decl = hsa_get_decl_kernel_mapping_decl (i);
+      CONSTRUCTOR_APPEND_ELT
+	(host_functions_vec, NULL_TREE,
+	 build_fold_addr_expr (hsa_get_host_function (decl)));
+    }
+  tree host_functions_ctor = build_constructor (host_functions_array_type,
+						host_functions_vec);
+  char tmp_name[64];
+  ASM_GENERATE_INTERNAL_LABEL (tmp_name, "__hsa_host_functions", 1);
+  tree hsa_host_func_table = build_decl (UNKNOWN_LOCATION, VAR_DECL,
+					 get_identifier (tmp_name),
+					 host_functions_array_type);
+  TREE_STATIC (hsa_host_func_table) = 1;
+  TREE_READONLY (hsa_host_func_table) = 1;
+  TREE_PUBLIC (hsa_host_func_table) = 0;
+  DECL_ARTIFICIAL (hsa_host_func_table) = 1;
+  DECL_IGNORED_P (hsa_host_func_table) = 1;
+  DECL_EXTERNAL (hsa_host_func_table) = 0;
+  TREE_CONSTANT (hsa_host_func_table) = 1;
+  DECL_INITIAL (hsa_host_func_table) = host_functions_ctor;
+  varpool_node::finalize_decl (hsa_host_func_table);
+
+  /* Following code emits list of kernel_info structures.  */
+
+  tree kernel_info_type = make_node (RECORD_TYPE);
+  tree id_f1 = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+			   get_identifier ("name"), ptr_type_node);
+  DECL_CHAIN (id_f1) = NULL_TREE;
+  tree id_f2 = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+			   get_identifier ("omp_data_size"),
+			   unsigned_type_node);
+  DECL_CHAIN (id_f2) = id_f1;
+  tree id_f3 = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+			   get_identifier ("kernel_dependencies_count"),
+			   unsigned_type_node);
+  DECL_CHAIN (id_f3) = id_f2;
+  tree id_f4 = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+			   get_identifier ("kernel_dependencies"),
+			   build_pointer_type (build_pointer_type
+					       (char_type_node)));
+  DECL_CHAIN (id_f4) = id_f3;
+  finish_builtin_struct (kernel_info_type, "__hsa_kernel_info", id_f4,
+			 NULL_TREE);
+
+  int_num_of_kernels = build_int_cstu (uint32_type_node, map_count);
+  tree kernel_info_vector_type = build_array_type
+    (kernel_info_type, build_index_type (int_num_of_kernels));
+
+  vec<constructor_elt, va_gc> *kernel_info_vector_vec = NULL;
+  tree kernel_dependencies_vector_type = NULL;
+
+  for (unsigned i = 0; i < map_count; ++i)
+    {
+      tree kernel = hsa_get_decl_kernel_mapping_decl (i);
+      char *name = hsa_get_decl_kernel_mapping_name (i);
+      unsigned len = strlen (name);
+      char *copy = XNEWVEC (char, len + 2);
+      copy[0] = '&';
+      memcpy (copy + 1, name, len);
+      copy[len + 1] = '\0';
+      len++;
+
+      tree kern_name = build_string (len, copy);
+      TREE_TYPE (kern_name) = build_array_type
+	(char_type_node, build_index_type (size_int (len)));
+      free (copy);
+
+      unsigned omp_size = hsa_get_decl_kernel_mapping_omp_size (i);
+      tree omp_data_size = build_int_cstu (uint32_type_node, omp_size);
+      unsigned count = 0;
+
+      kernel_dependencies_vector_type = build_array_type
+	(build_pointer_type (char_type_node),
+	 build_index_type (size_int (0)));
+
+      vec<constructor_elt, va_gc> *kernel_dependencies_vec = NULL;
+      if (hsa_decl_kernel_dependencies)
+	{
+	  vec<const char *> **slot;
+	  slot = hsa_decl_kernel_dependencies->get (kernel);
+	  if (slot)
+	    {
+	      vec <const char *> *dependencies = *slot;
+	      count = dependencies->length ();
+
+	      kernel_dependencies_vector_type = build_array_type
+		(build_pointer_type (char_type_node),
+		 build_index_type (size_int (count)));
+
+	      for (unsigned j = 0; j < count; j++)
+		{
+		  const char *d = (*dependencies)[j];
+		  len = strlen (d);
+		  tree dependency_name = build_string (len, d);
+		  TREE_TYPE (dependency_name) = build_array_type
+		    (char_type_node, build_index_type (size_int (len)));
+
+		  CONSTRUCTOR_APPEND_ELT
+		    (kernel_dependencies_vec, NULL_TREE,
+		     build1 (ADDR_EXPR,
+			     build_pointer_type (TREE_TYPE (dependency_name)),
+			     dependency_name));
+		}
+	    }
+	}
+
+      tree dependencies_count = build_int_cstu (uint32_type_node, count);
+
+      vec<constructor_elt, va_gc> *kernel_info_vec = NULL;
+      CONSTRUCTOR_APPEND_ELT (kernel_info_vec, NULL_TREE,
+			      build1 (ADDR_EXPR,
+				      build_pointer_type (TREE_TYPE
+							  (kern_name)),
+				      kern_name));
+      CONSTRUCTOR_APPEND_ELT (kernel_info_vec, NULL_TREE, omp_data_size);
+      CONSTRUCTOR_APPEND_ELT (kernel_info_vec, NULL_TREE, dependencies_count);
+
+      tree kernel_info_ctor = build_constructor (kernel_info_type,
+						 kernel_info_vec);
+
+      if (count > 0)
+	{
+	  ASM_GENERATE_INTERNAL_LABEL (tmp_name, "__hsa_dependencies_list", i);
+	  tree dependencies_list = build_decl (UNKNOWN_LOCATION, VAR_DECL,
+					       get_identifier (tmp_name),
+					       kernel_dependencies_vector_type);
+
+	  TREE_STATIC (dependencies_list) = 1;
+	  TREE_READONLY (dependencies_list) = 1;
+	  TREE_PUBLIC (dependencies_list) = 0;
+	  DECL_ARTIFICIAL (dependencies_list) = 1;
+	  DECL_IGNORED_P (dependencies_list) = 1;
+	  DECL_EXTERNAL (dependencies_list) = 0;
+	  TREE_CONSTANT (dependencies_list) = 1;
+	  DECL_INITIAL (dependencies_list) = build_constructor
+	    (kernel_dependencies_vector_type, kernel_dependencies_vec);
+	  varpool_node::finalize_decl (dependencies_list);
+
+	  CONSTRUCTOR_APPEND_ELT (kernel_info_vec, NULL_TREE,
+				  build1 (ADDR_EXPR,
+					  build_pointer_type
+					    (TREE_TYPE (dependencies_list)),
+					  dependencies_list));
+	}
+      else
+	CONSTRUCTOR_APPEND_ELT (kernel_info_vec, NULL_TREE, null_pointer_node);
+
+      CONSTRUCTOR_APPEND_ELT (kernel_info_vector_vec, NULL_TREE,
+			      kernel_info_ctor);
+    }
+
+  ASM_GENERATE_INTERNAL_LABEL (tmp_name, "__hsa_kernels", 1);
+  tree hsa_kernels = build_decl (UNKNOWN_LOCATION, VAR_DECL,
+				 get_identifier (tmp_name),
+				 kernel_info_vector_type);
+
+  TREE_STATIC (hsa_kernels) = 1;
+  TREE_READONLY (hsa_kernels) = 1;
+  TREE_PUBLIC (hsa_kernels) = 0;
+  DECL_ARTIFICIAL (hsa_kernels) = 1;
+  DECL_IGNORED_P (hsa_kernels) = 1;
+  DECL_EXTERNAL (hsa_kernels) = 0;
+  TREE_CONSTANT (hsa_kernels) = 1;
+  DECL_INITIAL (hsa_kernels) = build_constructor (kernel_info_vector_type,
+						  kernel_info_vector_vec);
+  varpool_node::finalize_decl (hsa_kernels);
+
+  tree hsa_image_desc_type = make_node (RECORD_TYPE);
+  id_f1 = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+			   get_identifier ("brig_module"), ptr_type_node);
+  DECL_CHAIN (id_f1) = NULL_TREE;
+  id_f2 = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+			   get_identifier ("kernel_count"),
+			   unsigned_type_node);
+
+  DECL_CHAIN (id_f2) = id_f1;
+  id_f3 = build_decl (BUILTINS_LOCATION, FIELD_DECL,
+			   get_identifier ("hsa_kernel_infos"),
+			   ptr_type_node);
+  DECL_CHAIN (id_f3) = id_f2;
+  finish_builtin_struct (hsa_image_desc_type, "__hsa_image_desc", id_f3,
+			 NULL_TREE);
+
+  vec<constructor_elt, va_gc> *img_desc_vec = NULL;
+  CONSTRUCTOR_APPEND_ELT (img_desc_vec, NULL_TREE,
+			  build_fold_addr_expr (brig_decl));
+  CONSTRUCTOR_APPEND_ELT (img_desc_vec, NULL_TREE,
+			  build_int_cstu (unsigned_type_node, map_count));
+  CONSTRUCTOR_APPEND_ELT (img_desc_vec, NULL_TREE,
+			  build1 (ADDR_EXPR,
+				  build_pointer_type (TREE_TYPE
+						      (hsa_kernels)),
+				  hsa_kernels));
+
+  tree img_desc_ctor = build_constructor (hsa_image_desc_type, img_desc_vec);
+
+  ASM_GENERATE_INTERNAL_LABEL (tmp_name, "__hsa_img_descriptor", 1);
+  tree hsa_img_descriptor = build_decl (UNKNOWN_LOCATION, VAR_DECL,
+					get_identifier (tmp_name),
+					hsa_image_desc_type);
+  TREE_STATIC (hsa_img_descriptor) = 1;
+  TREE_READONLY (hsa_img_descriptor) = 1;
+  TREE_PUBLIC (hsa_img_descriptor) = 0;
+  DECL_ARTIFICIAL (hsa_img_descriptor) = 1;
+  DECL_IGNORED_P (hsa_img_descriptor) = 1;
+  DECL_EXTERNAL (hsa_img_descriptor) = 0;
+  TREE_CONSTANT (hsa_img_descriptor) = 1;
+  DECL_INITIAL (hsa_img_descriptor) = img_desc_ctor;
+  varpool_node::finalize_decl (hsa_img_descriptor);
+
+  /* Construct the "host_table" libgomp expects. */
+  tree libgomp_host_table_type = build_array_type (ptr_type_node,
+						   build_index_type
+						   (build_int_cst
+						    (integer_type_node, 4)));
+  vec<constructor_elt, va_gc> *libgomp_host_table_vec = NULL;
+  tree host_func_table_addr = build_fold_addr_expr (hsa_host_func_table);
+  CONSTRUCTOR_APPEND_ELT (libgomp_host_table_vec, NULL_TREE,
+			  host_func_table_addr);
+  offset_int func_table_size = wi::to_offset (TYPE_SIZE_UNIT (ptr_type_node))
+    * map_count;
+  CONSTRUCTOR_APPEND_ELT (libgomp_host_table_vec, NULL_TREE,
+			  fold_build2 (POINTER_PLUS_EXPR,
+				       TREE_TYPE (host_func_table_addr),
+				       host_func_table_addr,
+				       build_int_cst (size_type_node,
+						      func_table_size.to_uhwi
+						      ())));
+  CONSTRUCTOR_APPEND_ELT (libgomp_host_table_vec, NULL_TREE, null_pointer_node);
+  CONSTRUCTOR_APPEND_ELT (libgomp_host_table_vec, NULL_TREE, null_pointer_node);
+  tree libgomp_host_table_ctor = build_constructor (libgomp_host_table_type,
+						    libgomp_host_table_vec);
+  ASM_GENERATE_INTERNAL_LABEL (tmp_name, "__hsa_libgomp_host_table", 1);
+  tree hsa_libgomp_host_table = build_decl (UNKNOWN_LOCATION, VAR_DECL,
+					    get_identifier (tmp_name),
+					    libgomp_host_table_type);
+
+  TREE_STATIC (hsa_libgomp_host_table) = 1;
+  TREE_READONLY (hsa_libgomp_host_table) = 1;
+  TREE_PUBLIC (hsa_libgomp_host_table) = 0;
+  DECL_ARTIFICIAL (hsa_libgomp_host_table) = 1;
+  DECL_IGNORED_P (hsa_libgomp_host_table) = 1;
+  DECL_EXTERNAL (hsa_libgomp_host_table) = 0;
+  TREE_CONSTANT (hsa_libgomp_host_table) = 1;
+  DECL_INITIAL (hsa_libgomp_host_table) = libgomp_host_table_ctor;
+  varpool_node::finalize_decl (hsa_libgomp_host_table);
+
+  /* Generate an initializer with a call to the registration routine.  */
+
+  tree offload_register = builtin_decl_explicit
+    (BUILT_IN_GOMP_OFFLOAD_REGISTER);
+  gcc_checking_assert (offload_register);
+
+  append_to_statement_list
+    (build_call_expr (offload_register, 3,
+		      build_fold_addr_expr (hsa_libgomp_host_table),
+		      /* 7 stands for HSA.  */
+		      build_int_cst (integer_type_node, 7),
+		      build_fold_addr_expr (hsa_img_descriptor)),
+     &hsa_ctor_statements);
+
+  cgraph_build_static_cdtor ('I', hsa_ctor_statements, DEFAULT_INIT_PRIORITY);
+
+  tree offload_unregister = builtin_decl_explicit
+    (BUILT_IN_GOMP_OFFLOAD_UNREGISTER);
+  gcc_checking_assert (offload_unregister);
+
+  append_to_statement_list
+    (build_call_expr (offload_unregister,
+		      3, build_fold_addr_expr (hsa_libgomp_host_table),
+		      /* 7 stands for HSA.  */
+		      build_int_cst (integer_type_node, 7),
+		      build_fold_addr_expr (hsa_img_descriptor)),
+     &hsa_dtor_statements);
+  cgraph_build_static_cdtor ('D', hsa_dtor_statements, DEFAULT_INIT_PRIORITY);
+}
+
+/* Required HSA section alignment. */
+
+#define HSA_SECTION_ALIGNMENT 16
+
+/* Emit the brig module we have compiled to a section in the final assembly and
+   also create a compile unit static constructor that will register the brig
+   module with libgomp.  */
+
+void
+hsa_output_brig (void)
+{
+  section *saved_section;
+
+  if (!brig_initialized)
+    return;
+
+  for (unsigned i = 0; i < function_call_linkage.length (); i++)
+    {
+      function_linkage_pair p = function_call_linkage[i];
+
+      BrigCodeOffset32_t *func_offset = function_offsets->get (p.function_decl);
+      gcc_assert (*func_offset);
+      BrigOperandCodeRef *code_ref = (BrigOperandCodeRef *)
+	(brig_operand.get_ptr_by_offset (p.offset));
+      gcc_assert (code_ref->base.kind == BRIG_KIND_OPERAND_CODE_REF);
+      code_ref->ref = htole32 (*func_offset);
+    }
+
+  /* Iterate all function declarations and if we meet a function that should
+     have module linkage and we are unable to emit HSAIL for the function,
+     then change the linkage to program linkage.  Doing so, we will emit
+     a valid BRIG image.  */
+  if (hsa_failed_functions != NULL && emitted_declarations != NULL)
+    for (hash_map <tree, BrigDirectiveExecutable *>::iterator it =
+	 emitted_declarations->begin (); it != emitted_declarations->end ();
+	 ++it)
+      {
+	if (hsa_failed_functions->contains ((*it).first))
+	  (*it).second->linkage = BRIG_LINKAGE_PROGRAM;
+      }
+
+  saved_section = in_section;
+
+  switch_to_section (get_section (BRIG_ELF_SECTION_NAME, SECTION_NOTYPE, NULL));
+  char tmp_name[64];
+  ASM_GENERATE_INTERNAL_LABEL (tmp_name, BRIG_LABEL_STRING, 1);
+  ASM_OUTPUT_LABEL (asm_out_file, tmp_name);
+  tree brig_id = get_identifier (tmp_name);
+  tree brig_decl = build_decl (UNKNOWN_LOCATION, VAR_DECL, brig_id,
+			       char_type_node);
+  SET_DECL_ASSEMBLER_NAME (brig_decl, brig_id);
+  TREE_ADDRESSABLE (brig_decl) = 1;
+  TREE_READONLY (brig_decl) = 1;
+  DECL_ARTIFICIAL (brig_decl) = 1;
+  DECL_IGNORED_P (brig_decl) = 1;
+  TREE_STATIC (brig_decl) = 1;
+  TREE_PUBLIC (brig_decl) = 0;
+  TREE_USED (brig_decl) = 1;
+  DECL_INITIAL (brig_decl) = brig_decl;
+  TREE_ASM_WRITTEN (brig_decl) = 1;
+
+  BrigModuleHeader module_header;
+  memcpy (&module_header.identification, "HSA BRIG",
+	  sizeof(module_header.identification));
+  module_header.brigMajor = htole32 (BRIG_VERSION_BRIG_MAJOR);
+  module_header.brigMinor = htole32 (BRIG_VERSION_BRIG_MINOR);
+  uint64_t section_index[3];
+
+  int data_padding, code_padding, operand_padding;
+  data_padding = HSA_SECTION_ALIGNMENT
+    - brig_data.total_size % HSA_SECTION_ALIGNMENT;
+  code_padding = HSA_SECTION_ALIGNMENT
+    - brig_code.total_size % HSA_SECTION_ALIGNMENT;
+  operand_padding = HSA_SECTION_ALIGNMENT
+    - brig_operand.total_size % HSA_SECTION_ALIGNMENT;
+
+  uint64_t module_size = sizeof (module_header) + sizeof (section_index)
+    + brig_data.total_size + data_padding
+    + brig_code.total_size + code_padding
+    + brig_operand.total_size + operand_padding;
+  gcc_assert ((module_size % 16) == 0);
+  module_header.byteCount = htole64 (module_size);
+  memset (&module_header.hash, 0, sizeof (module_header.hash));
+  module_header.reserved = 0;
+  module_header.sectionCount = htole32 (3);
+  module_header.sectionIndex = htole64 (sizeof (module_header));
+  assemble_string ((const char *) &module_header, sizeof(module_header));
+  uint64_t off = sizeof (module_header) + sizeof (section_index);
+  section_index[0] = htole64 (off);
+  off += brig_data.total_size + data_padding;
+  section_index[1] = htole64 (off);
+  off += brig_code.total_size + code_padding;
+  section_index[2] = htole64 (off);
+  assemble_string ((const char *) &section_index, sizeof (section_index));
+
+  char padding[HSA_SECTION_ALIGNMENT];
+  memset (padding, 0, sizeof(padding));
+
+  brig_data.output ();
+  assemble_string (padding, data_padding);
+  brig_code.output ();
+  assemble_string (padding, code_padding);
+  brig_operand.output ();
+  assemble_string (padding, operand_padding);
+
+  if (saved_section)
+    switch_to_section (saved_section);
+
+  hsa_output_kernel_mapping (brig_decl);
+
+  hsa_free_decl_kernel_mapping ();
+  brig_release_data ();
+  hsa_deinit_compilation_unit_data ();
+
+  delete emitted_declarations;
+  emitted_declarations = NULL;
+  delete function_offsets;
+  function_offsets = NULL;
+}
diff --git a/gcc/hsa-dump.c b/gcc/hsa-dump.c
new file mode 100644
index 0000000..83c89d1
--- /dev/null
+++ b/gcc/hsa-dump.c
@@ -0,0 +1,1127 @@
+/* Infrastructure to dump our HSAIL IL
+   Copyright (C) 2013-15 Free Software Foundation, Inc.
+   Contributed by Martin Jambor <mjambor@suse.cz> and
+   Martin Liska <mliska@suse.cz>.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "is-a.h"
+#include "vec.h"
+#include "tree.h"
+#include "cfg.h"
+#include "function.h"
+#include "dumpfile.h"
+#include "gimple-pretty-print.h"
+#include "cgraph.h"
+#include "print-tree.h"
+#include "symbol-summary.h"
+#include "hsa.h"
+
+/* Return textual name of TYPE.  */
+
+static const char *
+hsa_type_name (BrigType16_t type)
+{
+  switch (type)
+    {
+    case BRIG_TYPE_NONE:
+      return "none";
+    case BRIG_TYPE_U8:
+      return "u8";
+    case BRIG_TYPE_U16:
+      return "u16";
+    case BRIG_TYPE_U32:
+      return "u32";
+    case BRIG_TYPE_U64:
+      return "u64";
+    case BRIG_TYPE_S8:
+      return "s8";
+    case BRIG_TYPE_S16:
+      return "s16";
+    case BRIG_TYPE_S32:
+      return "s32";
+    case BRIG_TYPE_S64:
+      return "s64";
+    case BRIG_TYPE_F16:
+      return "f16";
+    case BRIG_TYPE_F32:
+      return "f32";
+    case BRIG_TYPE_F64:
+      return "f64";
+    case BRIG_TYPE_B1:
+      return "b1";
+    case BRIG_TYPE_B8:
+      return "b8";
+    case BRIG_TYPE_B16:
+      return "b16";
+    case BRIG_TYPE_B32:
+      return "b32";
+    case BRIG_TYPE_B64:
+      return "b64";
+    case BRIG_TYPE_B128:
+      return "b128";
+    case BRIG_TYPE_SAMP:
+      return "samp";
+    case BRIG_TYPE_ROIMG:
+      return "roimg";
+    case BRIG_TYPE_WOIMG:
+      return "woimg";
+    case BRIG_TYPE_RWIMG:
+      return "rwimg";
+    case BRIG_TYPE_SIG32:
+      return "sig32";
+    case BRIG_TYPE_SIG64:
+      return "sig64";
+    case BRIG_TYPE_U8X4:
+      return "u8x4";
+    case BRIG_TYPE_U8X8:
+      return "u8x8";
+    case BRIG_TYPE_U8X16:
+      return "u8x16";
+    case BRIG_TYPE_U16X2:
+      return "u16x2";
+    case BRIG_TYPE_U16X4:
+      return "u16x4";
+    case BRIG_TYPE_U16X8:
+      return "u16x8";
+    case BRIG_TYPE_U32X2:
+      return "u32x2";
+    case BRIG_TYPE_U32X4:
+      return "u32x4";
+    case BRIG_TYPE_U64X2:
+      return "u64x2";
+    case BRIG_TYPE_S8X4:
+      return "s8x4";
+    case BRIG_TYPE_S8X8:
+      return "s8x8";
+    case BRIG_TYPE_S8X16:
+      return "s8x16";
+    case BRIG_TYPE_S16X2:
+      return "s16x2";
+    case BRIG_TYPE_S16X4:
+      return "s16x4";
+    case BRIG_TYPE_S16X8:
+      return "s16x8";
+    case BRIG_TYPE_S32X2:
+      return "s32x2";
+    case BRIG_TYPE_S32X4:
+      return "s32x4";
+    case BRIG_TYPE_S64X2:
+      return "s64x2";
+    case BRIG_TYPE_F16X2:
+      return "f16x2";
+    case BRIG_TYPE_F16X4:
+      return "f16x4";
+    case BRIG_TYPE_F16X8:
+      return "f16x8";
+    case BRIG_TYPE_F32X2:
+      return "f32x2";
+    case BRIG_TYPE_F32X4:
+      return "f32x4";
+    case BRIG_TYPE_F64X2:
+      return "f64x2";
+    default:
+      return "UNKNOWN_TYPE";
+    }
+}
+
+/* Return textual name of OPCODE.  */
+
+static const char *
+hsa_opcode_name (BrigOpcode16_t opcode)
+{
+  switch (opcode)
+    {
+    case BRIG_OPCODE_NOP:
+      return "nop";
+    case BRIG_OPCODE_ABS:
+      return "abs";
+    case BRIG_OPCODE_ADD:
+      return "add";
+    case BRIG_OPCODE_BORROW:
+      return "borrow";
+    case BRIG_OPCODE_CARRY:
+      return "carry";
+    case BRIG_OPCODE_CEIL:
+      return "ceil";
+    case BRIG_OPCODE_COPYSIGN:
+      return "copysign";
+    case BRIG_OPCODE_DIV:
+      return "div";
+    case BRIG_OPCODE_FLOOR:
+      return "floor";
+    case BRIG_OPCODE_FMA:
+      return "fma";
+    case BRIG_OPCODE_FRACT:
+      return "fract";
+    case BRIG_OPCODE_MAD:
+      return "mad";
+    case BRIG_OPCODE_MAX:
+      return "max";
+    case BRIG_OPCODE_MIN:
+      return "min";
+    case BRIG_OPCODE_MUL:
+      return "mul";
+    case BRIG_OPCODE_MULHI:
+      return "mulhi";
+    case BRIG_OPCODE_NEG:
+      return "neg";
+    case BRIG_OPCODE_REM:
+      return "rem";
+    case BRIG_OPCODE_RINT:
+      return "rint";
+    case BRIG_OPCODE_SQRT:
+      return "sqrt";
+    case BRIG_OPCODE_SUB:
+      return "sub";
+    case BRIG_OPCODE_TRUNC:
+      return "trunc";
+    case BRIG_OPCODE_MAD24:
+      return "mad24";
+    case BRIG_OPCODE_MAD24HI:
+      return "mad24hi";
+    case BRIG_OPCODE_MUL24:
+      return "mul24";
+    case BRIG_OPCODE_MUL24HI:
+      return "mul24hi";
+    case BRIG_OPCODE_SHL:
+      return "shl";
+    case BRIG_OPCODE_SHR:
+      return "shr";
+    case BRIG_OPCODE_AND:
+      return "and";
+    case BRIG_OPCODE_NOT:
+      return "not";
+    case BRIG_OPCODE_OR:
+      return "or";
+    case BRIG_OPCODE_POPCOUNT:
+      return "popcount";
+    case BRIG_OPCODE_XOR:
+      return "xor";
+    case BRIG_OPCODE_BITEXTRACT:
+      return "bitextract";
+    case BRIG_OPCODE_BITINSERT:
+      return "bitinsert";
+    case BRIG_OPCODE_BITMASK:
+      return "bitmask";
+    case BRIG_OPCODE_BITREV:
+      return "bitrev";
+    case BRIG_OPCODE_BITSELECT:
+      return "bitselect";
+    case BRIG_OPCODE_FIRSTBIT:
+      return "firstbit";
+    case BRIG_OPCODE_LASTBIT:
+      return "lastbit";
+    case BRIG_OPCODE_COMBINE:
+      return "combine";
+    case BRIG_OPCODE_EXPAND:
+      return "expand";
+    case BRIG_OPCODE_LDA:
+      return "lda";
+    case BRIG_OPCODE_MOV:
+      return "mov";
+    case BRIG_OPCODE_SHUFFLE:
+      return "shuffle";
+    case BRIG_OPCODE_UNPACKHI:
+      return "unpackhi";
+    case BRIG_OPCODE_UNPACKLO:
+      return "unpacklo";
+    case BRIG_OPCODE_PACK:
+      return "pack";
+    case BRIG_OPCODE_UNPACK:
+      return "unpack";
+    case BRIG_OPCODE_CMOV:
+      return "cmov";
+    case BRIG_OPCODE_CLASS:
+      return "class";
+    case BRIG_OPCODE_NCOS:
+      return "ncos";
+    case BRIG_OPCODE_NEXP2:
+      return "nexp2";
+    case BRIG_OPCODE_NFMA:
+      return "nfma";
+    case BRIG_OPCODE_NLOG2:
+      return "nlog2";
+    case BRIG_OPCODE_NRCP:
+      return "nrcp";
+    case BRIG_OPCODE_NRSQRT:
+      return "nrsqrt";
+    case BRIG_OPCODE_NSIN:
+      return "nsin";
+    case BRIG_OPCODE_NSQRT:
+      return "nsqrt";
+    case BRIG_OPCODE_BITALIGN:
+      return "bitalign";
+    case BRIG_OPCODE_BYTEALIGN:
+      return "bytealign";
+    case BRIG_OPCODE_PACKCVT:
+      return "packcvt";
+    case BRIG_OPCODE_UNPACKCVT:
+      return "unpackcvt";
+    case BRIG_OPCODE_LERP:
+      return "lerp";
+    case BRIG_OPCODE_SAD:
+      return "sad";
+    case BRIG_OPCODE_SADHI:
+      return "sadhi";
+    case BRIG_OPCODE_SEGMENTP:
+      return "segmentp";
+    case BRIG_OPCODE_FTOS:
+      return "ftos";
+    case BRIG_OPCODE_STOF:
+      return "stof";
+    case BRIG_OPCODE_CMP:
+      return "cmp";
+    case BRIG_OPCODE_CVT:
+      return "cvt";
+    case BRIG_OPCODE_LD:
+      return "ld";
+    case BRIG_OPCODE_ST:
+      return "st";
+    case BRIG_OPCODE_ATOMIC:
+      return "atomic";
+    case BRIG_OPCODE_ATOMICNORET:
+      return "atomicnoret";
+    case BRIG_OPCODE_SIGNAL:
+      return "signal";
+    case BRIG_OPCODE_SIGNALNORET:
+      return "signalnoret";
+    case BRIG_OPCODE_MEMFENCE:
+      return "memfence";
+    case BRIG_OPCODE_RDIMAGE:
+      return "rdimage";
+    case BRIG_OPCODE_LDIMAGE:
+      return "ldimage";
+    case BRIG_OPCODE_STIMAGE:
+      return "stimage";
+    case BRIG_OPCODE_QUERYIMAGE:
+      return "queryimage";
+    case BRIG_OPCODE_QUERYSAMPLER:
+      return "querysampler";
+    case BRIG_OPCODE_CBR:
+      return "cbr";
+    case BRIG_OPCODE_BR:
+      return "br";
+    case BRIG_OPCODE_SBR:
+      return "sbr";
+    case BRIG_OPCODE_BARRIER:
+      return "barrier";
+    case BRIG_OPCODE_WAVEBARRIER:
+      return "wavebarrier";
+    case BRIG_OPCODE_ARRIVEFBAR:
+      return "arrivefbar";
+    case BRIG_OPCODE_INITFBAR:
+      return "initfbar";
+    case BRIG_OPCODE_JOINFBAR:
+      return "joinfbar";
+    case BRIG_OPCODE_LEAVEFBAR:
+      return "leavefbar";
+    case BRIG_OPCODE_RELEASEFBAR:
+      return "releasefbar";
+    case BRIG_OPCODE_WAITFBAR:
+      return "waitfbar";
+    case BRIG_OPCODE_LDF:
+      return "ldf";
+    case BRIG_OPCODE_ACTIVELANECOUNT:
+      return "activelanecount";
+    case BRIG_OPCODE_ACTIVELANEID:
+      return "activelaneid";
+    case BRIG_OPCODE_ACTIVELANEMASK:
+      return "activelanemask";
+    case BRIG_OPCODE_CALL:
+      return "call";
+    case BRIG_OPCODE_SCALL:
+      return "scall";
+    case BRIG_OPCODE_ICALL:
+      return "icall";
+    case BRIG_OPCODE_RET:
+      return "ret";
+    case BRIG_OPCODE_ALLOCA:
+      return "alloca";
+    case BRIG_OPCODE_CURRENTWORKGROUPSIZE:
+      return "currentworkgroupsize";
+    case BRIG_OPCODE_DIM:
+      return "dim";
+    case BRIG_OPCODE_GRIDGROUPS:
+      return "gridgroups";
+    case BRIG_OPCODE_GRIDSIZE:
+      return "gridsize";
+    case BRIG_OPCODE_PACKETCOMPLETIONSIG:
+      return "packetcompletionsig";
+    case BRIG_OPCODE_PACKETID:
+      return "packetid";
+    case BRIG_OPCODE_WORKGROUPID:
+      return "workgroupid";
+    case BRIG_OPCODE_WORKGROUPSIZE:
+      return "workgroupsize";
+    case BRIG_OPCODE_WORKITEMABSID:
+      return "workitemabsid";
+    case BRIG_OPCODE_WORKITEMFLATABSID:
+      return "workitemflatabsid";
+    case BRIG_OPCODE_WORKITEMFLATID:
+      return "workitemflatid";
+    case BRIG_OPCODE_WORKITEMID:
+      return "workitemid";
+    case BRIG_OPCODE_CLEARDETECTEXCEPT:
+      return "cleardetectexcept";
+    case BRIG_OPCODE_GETDETECTEXCEPT:
+      return "getdetectexcept";
+    case BRIG_OPCODE_SETDETECTEXCEPT:
+      return "setdetectexcept";
+    case BRIG_OPCODE_ADDQUEUEWRITEINDEX:
+      return "addqueuewriteindex";
+    case BRIG_OPCODE_CASQUEUEWRITEINDEX:
+      return "casqueuewriteindex";
+    case BRIG_OPCODE_LDQUEUEREADINDEX:
+      return "ldqueuereadindex";
+    case BRIG_OPCODE_LDQUEUEWRITEINDEX:
+      return "ldqueuewriteindex";
+    case BRIG_OPCODE_STQUEUEREADINDEX:
+      return "stqueuereadindex";
+    case BRIG_OPCODE_STQUEUEWRITEINDEX:
+      return "stqueuewriteindex";
+    case BRIG_OPCODE_CLOCK:
+      return "clock";
+    case BRIG_OPCODE_CUID:
+      return "cuid";
+    case BRIG_OPCODE_DEBUGTRAP:
+      return "debugtrap";
+    case BRIG_OPCODE_GROUPBASEPTR:
+      return "groupbaseptr";
+    case BRIG_OPCODE_KERNARGBASEPTR:
+      return "kernargbaseptr";
+    case BRIG_OPCODE_LANEID:
+      return "laneid";
+    case BRIG_OPCODE_MAXCUID:
+      return "maxcuid";
+    case BRIG_OPCODE_MAXWAVEID:
+      return "maxwaveid";
+    case BRIG_OPCODE_NULLPTR:
+      return "nullptr";
+    case BRIG_OPCODE_WAVEID:
+      return "waveid";
+    default:
+      return "UNKNOWN_OPCODE";
+    }
+}
+
+/* Return textual name of SEG.  */
+
+const char *
+hsa_seg_name (BrigSegment8_t seg)
+{
+  switch (seg)
+    {
+    case BRIG_SEGMENT_NONE:
+      return "none";
+    case BRIG_SEGMENT_FLAT:
+      return "flat";
+    case BRIG_SEGMENT_GLOBAL:
+      return "global";
+    case BRIG_SEGMENT_READONLY:
+      return "readonly";
+    case BRIG_SEGMENT_KERNARG:
+      return "kernarg";
+    case BRIG_SEGMENT_GROUP:
+      return "group";
+    case BRIG_SEGMENT_PRIVATE:
+      return "private";
+    case BRIG_SEGMENT_SPILL:
+      return "spill";
+    case BRIG_SEGMENT_ARG:
+      return "arg";
+    default:
+      return "UNKNOWN_SEGMENT";
+    }
+}
+
+/* Return textual name of CMPOP.  */
+
+static const char *
+hsa_cmpop_name (BrigCompareOperation8_t cmpop)
+{
+  switch (cmpop)
+    {
+    case BRIG_COMPARE_EQ:
+      return "eq";
+    case BRIG_COMPARE_NE:
+      return "ne";
+    case BRIG_COMPARE_LT:
+      return "lt";
+    case BRIG_COMPARE_LE:
+      return "le";
+    case BRIG_COMPARE_GT:
+      return "gt";
+    case BRIG_COMPARE_GE:
+      return "ge";
+    case BRIG_COMPARE_EQU:
+      return "equ";
+    case BRIG_COMPARE_NEU:
+      return "neu";
+    case BRIG_COMPARE_LTU:
+      return "ltu";
+    case BRIG_COMPARE_LEU:
+      return "leu";
+    case BRIG_COMPARE_GTU:
+      return "gtu";
+    case BRIG_COMPARE_GEU:
+      return "geu";
+    case BRIG_COMPARE_NUM:
+      return "num";
+    case BRIG_COMPARE_NAN:
+      return "nan";
+    case BRIG_COMPARE_SEQ:
+      return "seq";
+    case BRIG_COMPARE_SNE:
+      return "sne";
+    case BRIG_COMPARE_SLT:
+      return "slt";
+    case BRIG_COMPARE_SLE:
+      return "sle";
+    case BRIG_COMPARE_SGT:
+      return "sgt";
+    case BRIG_COMPARE_SGE:
+      return "sge";
+    case BRIG_COMPARE_SGEU:
+      return "sgeu";
+    case BRIG_COMPARE_SEQU:
+      return "sequ";
+    case BRIG_COMPARE_SNEU:
+      return "sneu";
+    case BRIG_COMPARE_SLTU:
+      return "sltu";
+    case BRIG_COMPARE_SLEU:
+      return "sleu";
+    case BRIG_COMPARE_SNUM:
+      return "snum";
+    case BRIG_COMPARE_SNAN:
+      return "snan";
+    case BRIG_COMPARE_SGTU:
+      return "sgtu";
+    default:
+      return "UNKNOWN_COMPARISON";
+    }
+}
+
+/* Return textual name for memory order.  */
+
+static const char *
+hsa_memsem_name (enum BrigMemoryOrder mo)
+{
+  switch (mo)
+    {
+    case BRIG_MEMORY_ORDER_NONE:
+      return "";
+    case BRIG_MEMORY_ORDER_RELAXED:
+      return "rlx";
+    case BRIG_MEMORY_ORDER_SC_ACQUIRE:
+      return "scacq";
+    case BRIG_MEMORY_ORDER_SC_RELEASE:
+      return "screl";
+    case BRIG_MEMORY_ORDER_SC_ACQUIRE_RELEASE:
+      return "scar";
+    default:
+      return "UNKNOWN_MEMORY_ORDER";
+    }
+}
+
+/* Return textual name for memory scope. */
+
+static const char *
+hsa_memscope_name (enum BrigMemoryScope scope)
+{
+  switch (scope)
+    {
+    case BRIG_MEMORY_SCOPE_NONE:
+      return "";
+    case BRIG_MEMORY_SCOPE_WORKITEM:
+      return "wi";
+    case BRIG_MEMORY_SCOPE_WAVEFRONT:
+      return "wave";
+    case BRIG_MEMORY_SCOPE_WORKGROUP:
+      return "wg";
+    case BRIG_MEMORY_SCOPE_AGENT:
+      return "agent";
+    case BRIG_MEMORY_SCOPE_SYSTEM:
+      return "sys";
+    default:
+      return "UNKNOWN_SCOPE";
+    }
+}
+
+/* Return textual name for atomic operation.  */
+
+static const char *
+hsa_m_atomicop_name (enum BrigAtomicOperation op)
+{
+  switch (op)
+    {
+    case BRIG_ATOMIC_ADD:
+      return "add";
+    case BRIG_ATOMIC_AND:
+      return "and";
+    case BRIG_ATOMIC_CAS:
+      return "cas";
+    case BRIG_ATOMIC_EXCH:
+      return "exch";
+    case BRIG_ATOMIC_LD:
+      return "ld";
+    case BRIG_ATOMIC_MAX:
+      return "max";
+    case BRIG_ATOMIC_MIN:
+      return "min";
+    case BRIG_ATOMIC_OR:
+      return "or";
+    case BRIG_ATOMIC_ST:
+      return "st";
+    case BRIG_ATOMIC_SUB:
+      return "sub";
+    case BRIG_ATOMIC_WRAPDEC:
+      return "wrapdec";
+    case BRIG_ATOMIC_WRAPINC:
+      return "wrapinc";
+    case BRIG_ATOMIC_XOR:
+      return "xor";
+    case BRIG_ATOMIC_WAIT_EQ:
+      return "wait_eq";
+    case BRIG_ATOMIC_WAIT_NE:
+      return "wait_ne";
+    case BRIG_ATOMIC_WAIT_LT:
+      return "wait_lt";
+    case BRIG_ATOMIC_WAIT_GTE:
+      return "wait_gte";
+    case BRIG_ATOMIC_WAITTIMEOUT_EQ:
+      return "waittimeout_eq";
+    case BRIG_ATOMIC_WAITTIMEOUT_NE:
+      return "waittimeout_ne";
+    case BRIG_ATOMIC_WAITTIMEOUT_LT:
+      return "waittimeout_lt";
+    case BRIG_ATOMIC_WAITTIMEOUT_GTE:
+      return "waittimeout_gte";
+    default:
+      return "UNKNOWN_ATOMIC_OP";
+    }
+}
+
+/* Dump textual representation of HSA IL register REG to file F.  */
+
+static void
+dump_hsa_reg (FILE *f, hsa_op_reg *reg, bool dump_type = false)
+{
+  if (reg->m_reg_class)
+    fprintf (f, "$%c%i", reg->m_reg_class, reg->m_hard_num);
+  else
+    fprintf (f, "$_%i", reg->m_order);
+  if (dump_type)
+    fprintf (f, " (%s)", hsa_type_name (reg->m_type));
+}
+
+/* Dump textual representation of HSA IL immediate operand IMM to file F.  */
+
+static void
+dump_hsa_immed (FILE *f, hsa_op_immed *imm)
+{
+  bool unsigned_int_type = (BRIG_TYPE_U8 | BRIG_TYPE_U16 | BRIG_TYPE_U32
+    | BRIG_TYPE_U64) & imm->m_type;
+
+  if (imm->m_tree_value)
+    print_generic_expr (f, imm->m_tree_value, 0);
+  else
+    {
+      gcc_checking_assert (imm->m_brig_repr_size <= 8);
+
+      if (unsigned_int_type)
+	fprintf (f, HOST_WIDE_INT_PRINT_DEC, imm->m_int_value);
+      else
+	fprintf (f, HOST_WIDE_INT_PRINT_UNSIGNED,
+		 (unsigned HOST_WIDE_INT)imm->m_int_value);
+    }
+
+  fprintf (f, " (%s)", hsa_type_name (imm->m_type));
+}
+
+/* Dump textual representation of HSA IL address operand ADDR to file F.  */
+
+static void
+dump_hsa_address (FILE *f, hsa_op_address *addr)
+{
+  bool sth = false;
+
+  if (addr->m_symbol)
+    {
+      sth = true;
+      if (addr->m_symbol->m_name)
+	fprintf (f, "[%%%s]", addr->m_symbol->m_name);
+      else
+	fprintf (f, "[%%__%s_%i]", hsa_seg_name (addr->m_symbol->m_segment),
+		 addr->m_symbol->m_name_number);
+    }
+
+  if (addr->m_reg)
+    {
+      fprintf (f, "[");
+      dump_hsa_reg (f, addr->m_reg);
+      if (addr->m_imm_offset != 0)
+	fprintf (f, " + " HOST_WIDE_INT_PRINT_DEC "]", addr->m_imm_offset);
+      else
+	fprintf (f, "]");
+    }
+  else if (!sth || addr->m_imm_offset != 0)
+    fprintf (f, "[" HOST_WIDE_INT_PRINT_DEC "]", addr->m_imm_offset);
+}
+
+/* Dump textual representation of HSA IL symbol SYMBOL to file F.  */
+
+static void
+dump_hsa_symbol (FILE *f, hsa_symbol *symbol)
+{
+  const char *name;
+  if (symbol->m_name)
+    name = symbol->m_name;
+  else
+    {
+      char buf[64];
+      sprintf (buf, "__%s_%i", hsa_seg_name (symbol->m_segment),
+	       symbol->m_name_number);
+
+      name = buf;
+    }
+
+  fprintf (f, "%s (%s)", name, hsa_type_name (symbol->m_type));
+}
+
+/* Dump textual representation of HSA IL operand OP to file F.  */
+
+static void
+dump_hsa_operand (FILE *f, hsa_op_base *op, bool dump_reg_type = false)
+{
+  if (is_a <hsa_op_immed *> (op))
+    dump_hsa_immed (f, as_a <hsa_op_immed *> (op));
+  else if (is_a <hsa_op_reg *> (op))
+    dump_hsa_reg (f, as_a <hsa_op_reg *> (op), dump_reg_type);
+  else if (is_a <hsa_op_address *> (op))
+    dump_hsa_address (f, as_a <hsa_op_address *> (op));
+  else
+    fprintf (f, "UNKNOWN_OP_KIND");
+}
+
+/* Dump textual representation of HSA IL operands in VEC to file F.  */
+
+static void
+dump_hsa_operands (FILE *f, hsa_insn_basic *insn, int start = 0,
+		   int end = -1, bool dump_reg_type = false)
+{
+  if (end == -1)
+    end = insn->operand_count ();
+
+  for (int i = start; i < end; i++)
+    {
+      dump_hsa_operand (f, insn->get_op (i), dump_reg_type);
+      if (i != end - 1)
+	fprintf (f, ", ");
+    }
+}
+
+/* Indent F stream with INDENT spaces.  */
+
+static void indent_stream (FILE *f, int indent)
+{
+  for (int i = 0; i < indent; i++)
+    fputc (' ', f);
+}
+
+/* Dump textual representation of HSA IL instruction INSN to file F.  Prepend
+   the instruction with *INDENT spaces and adjust the indentation for call
+   instructions as appropriate.  */
+
+static void
+dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
+{
+  gcc_checking_assert (insn);
+
+  if (insn->m_number)
+    fprintf (f, "%5d: ", insn->m_number);
+
+  indent_stream (f, *indent);
+
+  if (is_a <hsa_insn_phi *> (insn))
+    {
+      hsa_insn_phi *phi = as_a <hsa_insn_phi *> (insn);
+      bool first = true;
+      dump_hsa_reg (f, phi->m_dest, true);
+      fprintf (f, " = PHI <");
+      unsigned count = phi->operand_count ();
+      for (unsigned i = 0; i < count; i++)
+	{
+	  if (!phi->get_op (i))
+	    break;
+	  if (!first)
+	    fprintf (f, ", ");
+	  else
+	    first = false;
+	  dump_hsa_operand (f, phi->get_op (i), true);
+	}
+      fprintf (f, ">");
+    }
+  else if (is_a <hsa_insn_signal *> (insn))
+    {
+      hsa_insn_signal *mem = as_a <hsa_insn_signal *> (insn);
+
+      fprintf (f, "%s", hsa_opcode_name (mem->m_opcode));
+      fprintf (f, "_%s", hsa_m_atomicop_name (mem->m_atomicop));
+      if (mem->m_memoryorder != BRIG_MEMORY_ORDER_NONE)
+	fprintf (f, "_%s", hsa_memsem_name (mem->m_memoryorder));
+      fprintf (f, "_%s ", hsa_type_name (mem->m_type));
+
+      dump_hsa_operands (f, mem);
+    }
+
+  else if (is_a <hsa_insn_atomic *> (insn))
+    {
+      hsa_insn_atomic *mem = as_a <hsa_insn_atomic *> (insn);
+
+      /* Either operand[0] or operand[1] must be an address operand.  */
+      hsa_op_address *addr = NULL;
+      if (is_a <hsa_op_address *> (mem->get_op (0)))
+	addr = as_a <hsa_op_address *> (mem->get_op (0));
+      else
+	addr = as_a <hsa_op_address *> (mem->get_op (1));
+
+      fprintf (f, "%s", hsa_opcode_name (mem->m_opcode));
+      fprintf (f, "_%s", hsa_m_atomicop_name (mem->m_atomicop));
+      if (addr->m_symbol)
+	fprintf (f, "_%s", hsa_seg_name (addr->m_symbol->m_segment));
+      if (mem->m_memoryorder != BRIG_MEMORY_ORDER_NONE)
+	fprintf (f, "_%s", hsa_memsem_name (mem->m_memoryorder));
+      if (mem->m_memoryscope != BRIG_MEMORY_SCOPE_NONE)
+	fprintf (f, "_%s", hsa_memscope_name (mem->m_memoryscope));
+      fprintf (f, "_%s ", hsa_type_name (mem->m_type));
+
+      dump_hsa_operands (f, mem);
+    }
+  else if (is_a <hsa_insn_mem *> (insn))
+    {
+      hsa_insn_mem *mem = as_a <hsa_insn_mem *> (insn);
+      hsa_op_address *addr = as_a <hsa_op_address *> (mem->get_op (1));
+
+      fprintf (f, "%s", hsa_opcode_name (mem->m_opcode));
+      if (addr->m_symbol)
+	fprintf (f, "_%s", hsa_seg_name (addr->m_symbol->m_segment));
+      if (mem->m_equiv_class != 0)
+	fprintf (f, "_equiv(%i)", mem->m_equiv_class);
+      fprintf (f, "_%s ", hsa_type_name (mem->m_type));
+
+      dump_hsa_operand (f, mem->get_op (0));
+      fprintf (f, ", ");
+      dump_hsa_address (f, addr);
+    }
+  else if (insn->m_opcode == BRIG_OPCODE_LDA)
+    {
+      hsa_op_address *addr = as_a <hsa_op_address *> (insn->get_op (1));
+
+      fprintf (f, "%s", hsa_opcode_name (insn->m_opcode));
+      if (addr->m_symbol)
+	fprintf (f, "_%s", hsa_seg_name (addr->m_symbol->m_segment));
+      fprintf (f, "_%s ", hsa_type_name (insn->m_type));
+
+      dump_hsa_operand (f, insn->get_op (0));
+      fprintf (f, ", ");
+      dump_hsa_address (f, addr);
+    }
+  else if (is_a <hsa_insn_seg *> (insn))
+    {
+      hsa_insn_seg *seg = as_a <hsa_insn_seg *> (insn);
+      fprintf (f, "%s_%s_%s_%s ", hsa_opcode_name (seg->m_opcode),
+	       hsa_seg_name (seg->m_segment),
+	       hsa_type_name (seg->m_type), hsa_type_name (seg->m_src_type));
+      dump_hsa_reg (f, as_a <hsa_op_reg *> (seg->get_op (0)));
+      fprintf (f, ", ");
+      dump_hsa_operand (f, seg->get_op (1));
+    }
+  else if (is_a <hsa_insn_cmp *> (insn))
+    {
+      hsa_insn_cmp *cmp = as_a <hsa_insn_cmp *> (insn);
+      BrigType16_t src_type;
+
+      if (is_a <hsa_op_reg *> (cmp->get_op (1)))
+	src_type = as_a <hsa_op_reg *> (cmp->get_op (1))->m_type;
+      else
+	src_type = as_a <hsa_op_immed *> (cmp->get_op (1))->m_type;
+
+      fprintf (f, "%s_%s_%s_%s ", hsa_opcode_name (cmp->m_opcode),
+	       hsa_cmpop_name (cmp->m_compare),
+	       hsa_type_name (cmp->m_type), hsa_type_name (src_type));
+      dump_hsa_reg (f, as_a <hsa_op_reg *> (cmp->get_op (0)));
+      fprintf (f, ", ");
+      dump_hsa_operand (f, cmp->get_op (1));
+      fprintf (f, ", ");
+      dump_hsa_operand (f, cmp->get_op (2));
+    }
+  else if (is_a <hsa_insn_br *> (insn))
+    {
+      hsa_insn_br *br = as_a <hsa_insn_br *> (insn);
+      basic_block target = NULL;
+      edge_iterator ei;
+      edge e;
+
+      fprintf (f, "%s ", hsa_opcode_name (br->m_opcode));
+      if (br->m_opcode == BRIG_OPCODE_CBR)
+	{
+	  dump_hsa_reg (f, as_a <hsa_op_reg *> (br->get_op (0)));
+	  fprintf (f, ", ");
+	}
+
+      FOR_EACH_EDGE (e, ei, br->m_bb->succs)
+	if (e->flags & EDGE_TRUE_VALUE)
+	  {
+	    target = e->dest;
+	    break;
+	  }
+      fprintf (f, "BB %i", hsa_bb_for_bb (target)->m_index);
+    }
+  else if (is_a <hsa_insn_sbr *> (insn))
+    {
+      hsa_insn_sbr *sbr = as_a <hsa_insn_sbr *> (insn);
+
+      fprintf (f, "%s ", hsa_opcode_name (sbr->m_opcode));
+      dump_hsa_reg (f, as_a <hsa_op_reg *> (sbr->get_op (0)));
+      fprintf (f, ", [");
+
+      for (unsigned i = 0; i < sbr->m_jump_table.length (); i++)
+	{
+	  fprintf (f, "BB %i", hsa_bb_for_bb (sbr->m_jump_table[i])->m_index);
+	  if (i != sbr->m_jump_table.length () - 1)
+	    fprintf (f, ", ");
+	}
+
+      fprintf (f, "]");
+    }
+  else if (is_a <hsa_insn_arg_block *> (insn))
+    {
+      hsa_insn_arg_block *arg_block = as_a <hsa_insn_arg_block *> (insn);
+      bool start_p = arg_block->m_kind == BRIG_KIND_DIRECTIVE_ARG_BLOCK_START;
+      char c = start_p ? '{' : '}';
+
+      if (start_p)
+	{
+	  *indent += 2;
+	  indent_stream (f, 2);
+	}
+
+      if (!start_p)
+	*indent -= 2;
+
+      fprintf (f, "%c", c);
+    }
+  else if (is_a <hsa_insn_call *> (insn))
+    {
+      hsa_insn_call *call = as_a <hsa_insn_call *> (insn);
+      const char *name = hsa_get_declaration_name (call->m_called_function);
+
+      fprintf (f, "call &%s", name);
+
+      if (call->m_output_arg)
+	fprintf (f, "(%%res) ");
+
+      fprintf (f, "(");
+      for (unsigned i = 0; i < call->m_input_args.length (); i++)
+        {
+	  fprintf (f, "%%__arg_%u", i);
+
+	  if (i != call->m_input_args.length () - 1)
+	    fprintf (f, ", ");
+	}
+      fprintf (f, ")");
+    }
+  else if (is_a <hsa_insn_comment *> (insn))
+    {
+      hsa_insn_comment *c = as_a <hsa_insn_comment *> (insn);
+      fprintf (f, "%s", c->m_comment);
+    }
+  else if (is_a <hsa_insn_packed *> (insn))
+    {
+      hsa_insn_packed *packed = as_a <hsa_insn_packed *> (insn);
+
+      fprintf (f, "%s_v%u_%s_%s ", hsa_opcode_name (packed->m_opcode),
+	       packed->operand_count () - 1,
+	       hsa_type_name (packed->m_type),
+	       hsa_type_name (packed->m_source_type));
+
+      if (packed->m_opcode == BRIG_OPCODE_COMBINE)
+	{
+	  dump_hsa_operand (f, insn->get_op (0));
+	  fprintf (f, ", (");
+	  dump_hsa_operands (f, insn, 1);
+	  fprintf (f, ")");
+	}
+      else if (packed->m_opcode == BRIG_OPCODE_EXPAND)
+	{
+	  fprintf (f, "(");
+	  dump_hsa_operands (f, insn, 0, insn->operand_count () - 1);
+	  fprintf (f, "), ");
+	  dump_hsa_operand (f, insn->get_op (insn->operand_count () - 1));
+
+	}
+      else
+	gcc_unreachable ();
+    }
+  else
+    {
+      fprintf (f, "%s_%s ", hsa_opcode_name (insn->m_opcode),
+	       hsa_type_name (insn->m_type));
+
+      dump_hsa_operands (f, insn);
+    }
+
+  if (insn->m_brig_offset)
+    {
+      fprintf (f, "             /* BRIG offset: %u", insn->m_brig_offset);
+
+      for (unsigned i = 0; i < insn->operand_count (); i++)
+	fprintf (f, ", op%u: %u", i, insn->get_op (i)->m_brig_op_offset);
+
+      fprintf (f, " */");
+    }
+
+  fprintf (f, "\n");
+}
+
+/* Dump textual representation of HSA IL instruction INSN to file F.  */
+
+void
+dump_hsa_insn (FILE *f, hsa_insn_basic *insn)
+{
+  int indent = 0;
+  dump_hsa_insn_1 (f, insn, &indent);
+}
+
+/* Dump textual representation of HSA IL in HBB to file F.  */
+
+void
+dump_hsa_bb (FILE *f, hsa_bb *hbb)
+{
+  hsa_insn_basic *insn;
+  edge_iterator ei;
+  edge e;
+  basic_block true_bb = NULL, other = NULL;
+
+  fprintf (f, "BB %i:\n", hbb->m_index);
+
+  int indent = 2;
+  for (insn = hbb->m_first_phi; insn; insn = insn->m_next)
+    dump_hsa_insn_1 (f, insn, &indent);
+
+  for (insn = hbb->m_first_insn; insn; insn = insn->m_next)
+    dump_hsa_insn_1 (f, insn, &indent);
+
+  if (hbb->m_last_insn && is_a <hsa_insn_sbr *> (hbb->m_last_insn))
+    goto exit;
+
+  FOR_EACH_EDGE (e, ei, hbb->m_bb->succs)
+    if (e->flags & EDGE_TRUE_VALUE)
+      {
+	gcc_assert (!true_bb);
+	true_bb = e->dest;
+      }
+    else
+      {
+	gcc_assert (!other);
+	other = e->dest;
+      }
+
+  if (true_bb)
+    {
+      if (!hbb->m_last_insn
+	  || hbb->m_last_insn->m_opcode != BRIG_OPCODE_CBR)
+	fprintf (f, "WARNING: No branch insn for a true edge. \n");
+    }
+  else if (hbb->m_last_insn
+	   && hbb->m_last_insn->m_opcode == BRIG_OPCODE_CBR)
+    fprintf (f, "WARNING: No true edge for a cbr statement\n");
+
+  if (other && other->aux)
+    fprintf (f, "  Fall-through to BB %i\n",
+	     hsa_bb_for_bb (other)->m_index);
+  else if (hbb->m_last_insn
+	   && hbb->m_last_insn->m_opcode != BRIG_OPCODE_RET)
+    fprintf (f, "  WARNING: Fall through to a BB with no aux!\n");
+
+exit:
+  fprintf (f, "\n");
+}
+
+/* Dump textual representation of HSA IL of the current function to file F.  */
+
+void
+dump_hsa_cfun (FILE *f)
+{
+  basic_block bb;
+
+  fprintf (f, "\nHSAIL IL for %s\n", hsa_cfun->m_name);
+
+  FOR_ALL_BB_FN (bb, cfun)
+  {
+    hsa_bb *hbb = (struct hsa_bb *) bb->aux;
+    dump_hsa_bb (f, hbb);
+  }
+}
+
+/* Dump textual representation of HSA IL instruction INSN to stderr.  */
+
+DEBUG_FUNCTION void
+debug_hsa_insn (hsa_insn_basic *insn)
+{
+  dump_hsa_insn (stderr, insn);
+}
+
+/* Dump textual representation of HSA IL in HBB to stderr.  */
+
+DEBUG_FUNCTION void
+debug_hsa_bb (hsa_bb *hbb)
+{
+  dump_hsa_bb (stderr, hbb);
+}
+
+/* Dump textual representation of HSA IL of the current function to stderr.  */
+
+DEBUG_FUNCTION void
+debug_hsa_cfun (void)
+{
+  dump_hsa_cfun (stderr);
+}
+
+/* Dump textual representation of an HSA operand to stderr.  */
+
+DEBUG_FUNCTION void
+debug_hsa_operand (hsa_op_base *opc)
+{
+  dump_hsa_operand (stderr, opc, true);
+  fprintf (stderr, "\n");
+}
+
+/* Dump textual representation of as HSA symbol.  */
+
+DEBUG_FUNCTION void
+debug_hsa_symbol (hsa_symbol *symbol)
+{
+  dump_hsa_symbol (stderr, symbol);
+  fprintf (stderr, "\n");
+}
diff --git a/gcc/hsa-gen.c b/gcc/hsa-gen.c
new file mode 100644
index 0000000..600b2ca
--- /dev/null
+++ b/gcc/hsa-gen.c
@@ -0,0 +1,5515 @@
+/* A pass for lowering gimple to HSAIL
+   Copyright (C) 2013-15 Free Software Foundation, Inc.
+   Contributed by Martin Jambor <mjambor@suse.cz> and
+   Martin Liska <mliska@suse.cz>.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "is-a.h"
+#include "hash-table.h"
+#include "vec.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "cfg.h"
+#include "function.h"
+#include "basic-block.h"
+#include "fold-const.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "bitmap.h"
+#include "dumpfile.h"
+#include "gimple-pretty-print.h"
+#include "diagnostic-core.h"
+#include "alloc-pool.h"
+#include "gimple-ssa.h"
+#include "tree-phinodes.h"
+#include "stringpool.h"
+#include "tree-ssanames.h"
+#include "tree-dfa.h"
+#include "ssa-iterators.h"
+#include "cgraph.h"
+#include "print-tree.h"
+#include "symbol-summary.h"
+#include "hsa.h"
+#include "cfghooks.h"
+#include "tree-cfg.h"
+#include "cfgloop.h"
+#include "cfganal.h"
+#include "builtins.h"
+#include "params.h"
+#include "gomp-constants.h"
+
+/* Print a warning message and set that we have seen an error.  */
+
+#define HSA_SORRY_ATV(location, message, ...) \
+  do \
+  { \
+    hsa_fail_cfun (); \
+    if (warning_at (EXPR_LOCATION (hsa_cfun->m_decl), OPT_Whsa, \
+		    HSA_SORRY_MSG)) \
+      inform (location, message, __VA_ARGS__); \
+  } \
+  while (false);
+
+/* Same as previous, but highlight a location.  */
+
+#define HSA_SORRY_AT(location, message) \
+  do \
+  { \
+    hsa_fail_cfun (); \
+    if (warning_at (EXPR_LOCATION (hsa_cfun->m_decl), OPT_Whsa, \
+		    HSA_SORRY_MSG)) \
+      inform (location, message); \
+  } \
+  while (false);
+
+/* Default number of threads used by kernel dispatch.  */
+
+#define HSA_DEFAULT_NUM_THREADS 64
+
+/* Following structures are defined in the final version
+   of HSA specification.  */
+
+/* HSA queue packet is shadow structure, originally provided by AMD.  */
+
+struct hsa_queue_packet
+{
+  uint16_t header;
+  uint16_t setup;
+  uint16_t workgroup_size_x;
+  uint16_t workgroup_size_y;
+  uint16_t workgroup_size_z;
+  uint16_t reserved0;
+  uint32_t grid_size_x;
+  uint32_t grid_size_y;
+  uint32_t grid_size_z;
+  uint32_t private_segment_size;
+  uint32_t group_segment_size;
+  uint64_t kernel_object;
+  void *kernarg_address;
+  uint64_t reserved2;
+  uint64_t completion_signal;
+};
+
+/* HSA queue is shadow structure, originally provided by AMD.  */
+
+struct hsa_queue
+{
+  int type;
+  uint32_t features;
+  void *base_address;
+  uint64_t doorbell_signal;
+  uint32_t size;
+  uint32_t reserved1;
+  uint64_t id;
+};
+
+/* Alloc pools for allocating basic hsa structures such as operands,
+   instructions and other basic entities.s */
+static object_allocator<hsa_op_address> *hsa_allocp_operand_address;
+static object_allocator<hsa_op_immed> *hsa_allocp_operand_immed;
+static object_allocator<hsa_op_reg> *hsa_allocp_operand_reg;
+static object_allocator<hsa_op_code_list> *hsa_allocp_operand_code_list;
+static object_allocator<hsa_op_operand_list> *hsa_allocp_operand_operand_list;
+static object_allocator<hsa_insn_basic> *hsa_allocp_inst_basic;
+static object_allocator<hsa_insn_phi> *hsa_allocp_inst_phi;
+static object_allocator<hsa_insn_mem> *hsa_allocp_inst_mem;
+static object_allocator<hsa_insn_atomic> *hsa_allocp_inst_atomic;
+static object_allocator<hsa_insn_signal> *hsa_allocp_inst_signal;
+static object_allocator<hsa_insn_seg> *hsa_allocp_inst_seg;
+static object_allocator<hsa_insn_cmp> *hsa_allocp_inst_cmp;
+static object_allocator<hsa_insn_br> *hsa_allocp_inst_br;
+static object_allocator<hsa_insn_sbr> *hsa_allocp_inst_sbr;
+static object_allocator<hsa_insn_call> *hsa_allocp_inst_call;
+static object_allocator<hsa_insn_arg_block> *hsa_allocp_inst_arg_block;
+static object_allocator<hsa_insn_comment> *hsa_allocp_inst_comment;
+static object_allocator<hsa_insn_queue> *hsa_allocp_inst_queue;
+static object_allocator<hsa_insn_packed> *hsa_allocp_inst_packed;
+static object_allocator<hsa_insn_cvt> *hsa_allocp_inst_cvt;
+static object_allocator<hsa_bb> *hsa_allocp_bb;
+
+/* List of pointers to all instructions that come from an object allocator.  */
+static vec <hsa_insn_basic *> hsa_instructions;
+
+/* List of pointers to all operands that come from an object allocator.  */
+static vec <hsa_op_base *> hsa_operands;
+
+hsa_symbol::hsa_symbol ()
+: m_decl (NULL_TREE), m_name (NULL), m_name_number (0),
+  m_directive_offset (0), m_type (BRIG_TYPE_NONE),
+  m_segment (BRIG_SEGMENT_NONE), m_linkage (BRIG_LINKAGE_NONE), m_dim (0),
+  m_cst_value (NULL), m_global_scope_p (false), m_seen_error (false)
+{
+}
+
+
+hsa_symbol::hsa_symbol (BrigType16_t type, BrigSegment8_t segment,
+			BrigLinkage8_t linkage)
+: m_decl (NULL_TREE), m_name (NULL), m_name_number (0),
+  m_directive_offset (0), m_type (type), m_segment (segment),
+  m_linkage (linkage), m_dim (0), m_cst_value (NULL), m_global_scope_p (false),
+  m_seen_error (false)
+{
+}
+
+unsigned HOST_WIDE_INT
+hsa_symbol::total_byte_size ()
+{
+  unsigned HOST_WIDE_INT s = hsa_type_bit_size (~BRIG_TYPE_ARRAY_MASK & m_type);
+  gcc_assert (s % BITS_PER_UNIT == 0);
+  s /= BITS_PER_UNIT;
+
+  if (m_dim)
+    s *= m_dim;
+
+  return s;
+}
+
+/* Forward declaration.  */
+
+static BrigType16_t
+hsa_type_for_tree_type (const_tree type, unsigned HOST_WIDE_INT *dim_p,
+			bool min32int);
+
+void
+hsa_symbol::fillup_for_decl (tree decl)
+{
+  m_decl = decl;
+  m_type = hsa_type_for_tree_type (TREE_TYPE (decl), &m_dim, false);
+
+  if (hsa_seen_error ())
+    m_seen_error = true;
+}
+
+/* Constructor of class representing global HSA function/kernel information and
+   state.  FNDECL is function declaration, KERNEL_P is true if the function
+   is going to become a HSA kernel.  If the function has body, SSA_NAMES_COUNT
+   should be set to number of SSA names used in the function.  */
+
+hsa_function_representation::hsa_function_representation
+  (tree fdecl, bool kernel_p, unsigned ssa_names_count): m_name (NULL),
+  m_reg_count (0), m_input_args (vNULL),
+  m_output_arg (NULL), m_spill_symbols (vNULL), m_readonly_variables (vNULL),
+  m_private_variables (vNULL), m_called_functions (vNULL), m_hbb_count (0),
+  m_in_ssa (true), m_kern_p (kernel_p), m_declaration_p (false), m_decl (fdecl),
+  m_shadow_reg (NULL), m_kernel_dispatch_count (0), m_maximum_omp_data_size (0),
+  m_seen_error (false), m_temp_symbol_count (0), m_ssa_map ()
+{
+  int sym_init_len = (vec_safe_length (cfun->local_decls) / 2) + 1;;
+  m_local_symbols = new hash_table <hsa_noop_symbol_hasher> (sym_init_len);
+  m_ssa_map.safe_grow_cleared (ssa_names_count);
+}
+
+/* Destructor of class holding function/kernel-wide information and state.  */
+
+hsa_function_representation::~hsa_function_representation ()
+{
+  /* Kernel names are deallocated at the end of BRIG output when deallocating
+     hsa_decl_kernel_mapping.  */
+  if (!m_kern_p || m_seen_error)
+    free (m_name);
+
+  for (unsigned i = 0; i < m_input_args.length (); i++)
+    delete m_input_args[i];
+  m_input_args.release ();
+
+  delete m_output_arg;
+  delete m_local_symbols;
+
+  for (unsigned i = 0; i < m_spill_symbols.length (); i++)
+    delete m_spill_symbols[i];
+  m_spill_symbols.release ();
+
+  for (unsigned i = 0; i < m_readonly_variables.length (); i++)
+    delete m_readonly_variables[i];
+  m_readonly_variables.release ();
+
+  for (unsigned i = 0; i < m_private_variables.length (); i++)
+    delete m_private_variables[i];
+  m_private_variables.release ();
+  m_called_functions.release ();
+  m_ssa_map.release ();
+}
+
+hsa_op_reg *
+hsa_function_representation::get_shadow_reg ()
+{
+  /* If we compile a function with kernel dispatch and does not set
+     an optimization level, the function won't be inlined and
+     we return NULL.  */
+  if (!m_kern_p)
+    return NULL;
+
+  if (m_shadow_reg)
+    return m_shadow_reg;
+
+  /* Append the shadow argument.  */
+  hsa_symbol *shadow = new hsa_symbol (BRIG_TYPE_U64, BRIG_SEGMENT_KERNARG,
+				       BRIG_LINKAGE_FUNCTION);
+  m_input_args.safe_push (shadow);
+  shadow->m_name = "hsa_runtime_shadow";
+
+  hsa_op_reg *r = new hsa_op_reg (BRIG_TYPE_U64);
+  hsa_op_address *addr = new hsa_op_address (shadow);
+
+  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, r, addr);
+  hsa_bb_for_bb (ENTRY_BLOCK_PTR_FOR_FN (cfun))->append_insn (mem);
+  m_shadow_reg = r;
+
+  return r;
+}
+
+bool hsa_function_representation::has_shadow_reg_p ()
+{
+  return m_shadow_reg != NULL;
+}
+
+void
+hsa_function_representation::init_extra_bbs ()
+{
+  hsa_init_new_bb (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+  hsa_init_new_bb (EXIT_BLOCK_PTR_FOR_FN (cfun));
+}
+
+hsa_symbol *
+hsa_function_representation::create_hsa_temporary (BrigType16_t type)
+{
+  hsa_symbol *s = new hsa_symbol (type, BRIG_SEGMENT_PRIVATE,
+				  BRIG_LINKAGE_FUNCTION);
+  s->m_name_number = m_temp_symbol_count++;
+
+  hsa_cfun->m_private_variables.safe_push (s);
+  return s;
+}
+
+/* Allocate HSA structures that we need only while generating with this.  */
+
+static void
+hsa_init_data_for_cfun ()
+{
+  hsa_init_compilation_unit_data ();
+  hsa_allocp_operand_address
+    = new object_allocator<hsa_op_address> ("HSA address operands");
+  hsa_allocp_operand_immed
+    = new object_allocator<hsa_op_immed> ("HSA immediate operands");
+  hsa_allocp_operand_reg
+    = new object_allocator<hsa_op_reg> ("HSA register operands");
+  hsa_allocp_operand_code_list
+    = new object_allocator<hsa_op_code_list> ("HSA code list operands");
+  hsa_allocp_operand_operand_list
+    = new object_allocator<hsa_op_operand_list> ("HSA operand list operands");
+  hsa_allocp_inst_basic
+    = new object_allocator<hsa_insn_basic> ("HSA basic instructions");
+  hsa_allocp_inst_phi
+    = new object_allocator<hsa_insn_phi> ("HSA phi operands");
+  hsa_allocp_inst_mem
+    = new object_allocator<hsa_insn_mem> ("HSA memory instructions");
+  hsa_allocp_inst_atomic
+    = new object_allocator<hsa_insn_atomic> ("HSA atomic instructions");
+  hsa_allocp_inst_signal
+    = new object_allocator<hsa_insn_signal> ("HSA signal instructions");
+  hsa_allocp_inst_seg
+    = new object_allocator<hsa_insn_seg> ("HSA segment conversion instructions");
+  hsa_allocp_inst_cmp
+    = new object_allocator<hsa_insn_cmp> ("HSA comparison instructions");
+  hsa_allocp_inst_br
+    = new object_allocator<hsa_insn_br> ("HSA branching instructions");
+  hsa_allocp_inst_sbr
+    = new object_allocator<hsa_insn_sbr> ("HSA switch branching instructions");
+  hsa_allocp_inst_call
+    = new object_allocator<hsa_insn_call> ("HSA call instructions");
+  hsa_allocp_inst_arg_block
+    = new object_allocator<hsa_insn_arg_block> ("HSA arg block instructions");
+  hsa_allocp_inst_comment
+    = new object_allocator<hsa_insn_comment> ("HSA comment instructions");
+  hsa_allocp_inst_queue
+    = new object_allocator<hsa_insn_queue> ("HSA queue instructions");
+  hsa_allocp_inst_packed
+    = new object_allocator<hsa_insn_packed> ("HSA packed instructions");
+  hsa_allocp_inst_cvt
+    = new object_allocator<hsa_insn_cvt> ("HSA convert instructions");
+  hsa_allocp_bb = new object_allocator<hsa_bb> ("HSA basic blocks");
+}
+
+/* Deinitialize HSA subsystem and free all allocated memory.  */
+
+static void
+hsa_deinit_data_for_cfun (void)
+{
+  basic_block bb;
+
+  FOR_ALL_BB_FN (bb, cfun)
+    if (bb->aux)
+      {
+	hsa_bb *hbb = hsa_bb_for_bb (bb);
+	hbb->~hsa_bb ();
+	bb->aux = NULL;
+      }
+
+  for (unsigned int i = 0; i < hsa_operands.length (); i++)
+    hsa_destroy_operand (hsa_operands[i]);
+
+  hsa_operands.release ();
+
+  for (unsigned i = 0; i < hsa_instructions.length (); i++)
+    hsa_destroy_insn (hsa_instructions[i]);
+
+  hsa_instructions.release ();
+
+  delete hsa_allocp_operand_address;
+  delete hsa_allocp_operand_immed;
+  delete hsa_allocp_operand_reg;
+  delete hsa_allocp_operand_code_list;
+  delete hsa_allocp_operand_operand_list;
+  delete hsa_allocp_inst_basic;
+  delete hsa_allocp_inst_phi;
+  delete hsa_allocp_inst_atomic;
+  delete hsa_allocp_inst_mem;
+  delete hsa_allocp_inst_signal;
+  delete hsa_allocp_inst_seg;
+  delete hsa_allocp_inst_cmp;
+  delete hsa_allocp_inst_br;
+  delete hsa_allocp_inst_sbr;
+  delete hsa_allocp_inst_call;
+  delete hsa_allocp_inst_arg_block;
+  delete hsa_allocp_inst_comment;
+  delete hsa_allocp_inst_queue;
+  delete hsa_allocp_inst_packed;
+  delete hsa_allocp_inst_cvt;
+  delete hsa_allocp_bb;
+  delete hsa_cfun;
+}
+
+/* Return the type which holds addresses in the given SEGMENT.  */
+
+static BrigType16_t
+hsa_get_segment_addr_type (BrigSegment8_t segment)
+{
+  switch (segment)
+    {
+    case BRIG_SEGMENT_NONE:
+      gcc_unreachable ();
+
+    case BRIG_SEGMENT_FLAT:
+    case BRIG_SEGMENT_GLOBAL:
+    case BRIG_SEGMENT_READONLY:
+    case BRIG_SEGMENT_KERNARG:
+      return hsa_machine_large_p () ? BRIG_TYPE_U64 : BRIG_TYPE_U32;
+
+    case BRIG_SEGMENT_GROUP:
+    case BRIG_SEGMENT_PRIVATE:
+    case BRIG_SEGMENT_SPILL:
+    case BRIG_SEGMENT_ARG:
+      return BRIG_TYPE_U32;
+    }
+  gcc_unreachable ();
+}
+
+/* Return integer brig type according to provided SIZE in bytes.  If SIGN
+   is set to true, return signed integer type.  */
+
+static BrigType16_t
+get_integer_type_by_bytes (unsigned size, bool sign)
+{
+  if (sign)
+    switch (size)
+      {
+      case 1:
+	return BRIG_TYPE_S8;
+      case 2:
+	return BRIG_TYPE_S16;
+      case 4:
+	return BRIG_TYPE_S32;
+      case 8:
+	return BRIG_TYPE_S64;
+      default:
+	break;
+      }
+  else
+    switch (size)
+      {
+      case 1:
+	return BRIG_TYPE_U8;
+      case 2:
+	return BRIG_TYPE_U16;
+      case 4:
+	return BRIG_TYPE_U32;
+      case 8:
+	return BRIG_TYPE_U64;
+      default:
+	break;
+      }
+
+  return 0;
+}
+
+/* Return HSA type for tree TYPE, which has to fit into BrigType16_t.  Pointers
+   are assumed to use flat addressing.  If min32int is true, always expand
+   integer types to one that has at least 32 bits.  */
+
+static BrigType16_t
+hsa_type_for_scalar_tree_type (const_tree type, bool min32int)
+{
+  HOST_WIDE_INT bsize;
+  const_tree base;
+  BrigType16_t res = BRIG_TYPE_NONE;
+
+  gcc_checking_assert (TYPE_P (type));
+  gcc_checking_assert (!AGGREGATE_TYPE_P (type));
+  if (POINTER_TYPE_P (type))
+    return hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT);
+
+  if (TREE_CODE (type) == VECTOR_TYPE || TREE_CODE (type) == COMPLEX_TYPE)
+    base = TREE_TYPE (type);
+  else
+    base = type;
+
+  if (!tree_fits_uhwi_p (TYPE_SIZE (base)))
+    {
+      HSA_SORRY_ATV (EXPR_LOCATION (type),
+		     "support for HSA does not implement huge or "
+		     "variable-sized type %T", type);
+      return res;
+    }
+
+  bsize = tree_to_uhwi (TYPE_SIZE (base));
+  unsigned byte_size = bsize / BITS_PER_UNIT;
+  if (INTEGRAL_TYPE_P (base))
+    res = get_integer_type_by_bytes (byte_size, !TYPE_UNSIGNED (base));
+  else if (SCALAR_FLOAT_TYPE_P (base))
+    {
+      switch (bsize)
+	{
+	case 16:
+	  res = BRIG_TYPE_F16;
+	  break;
+	case 32:
+	  res = BRIG_TYPE_F32;
+	  break;
+	case 64:
+	  res = BRIG_TYPE_F64;
+	  break;
+	default:
+	  break;
+	}
+    }
+
+  if (res == BRIG_TYPE_NONE)
+    {
+      HSA_SORRY_ATV (EXPR_LOCATION (type),
+		     "support for HSA does not implement type %T", type);
+      return res;
+    }
+
+  if (TREE_CODE (type) == VECTOR_TYPE)
+    {
+      HOST_WIDE_INT tsize = tree_to_uhwi (TYPE_SIZE (type));
+
+      if (bsize == tsize)
+	{
+	  HSA_SORRY_ATV (EXPR_LOCATION (type),
+			 "support for HSA does not implement a vector type "
+			 "where a type and unit size are equal: %T", type);
+	  return res;
+	}
+
+      switch (tsize)
+	{
+	case 32:
+	  res |= BRIG_TYPE_PACK_32;
+	  break;
+	case 64:
+	  res |= BRIG_TYPE_PACK_64;
+	  break;
+	case 128:
+	  res |= BRIG_TYPE_PACK_128;
+	  break;
+	default:
+	  HSA_SORRY_ATV (EXPR_LOCATION (type),
+			 "support for HSA does not implement type %T", type);
+	}
+    }
+
+  if (min32int)
+    {
+      /* Registers/immediate operands can only be 32bit or more except for
+         f16.  */
+      if (res == BRIG_TYPE_U8 || res == BRIG_TYPE_U16)
+	res = BRIG_TYPE_U32;
+      else if (res == BRIG_TYPE_S8 || res == BRIG_TYPE_S16)
+	res = BRIG_TYPE_S32;
+    }
+
+  if (TREE_CODE (type) == COMPLEX_TYPE)
+    {
+      unsigned bsize = 2 * hsa_type_bit_size (res);
+      res = hsa_bittype_for_bitsize (bsize);
+    }
+
+  return res;
+}
+
+/* Returns the BRIG type we need to load/store entities of TYPE.  */
+
+static BrigType16_t
+mem_type_for_type (BrigType16_t type)
+{
+  /* HSA has non-intuitive constraints on load/store types.  If it's
+     a bit-type it _must_ be B128, if it's not a bit-type it must be
+     64bit max.  So for loading entities of 128 bits (e.g. vectors)
+     we have to to B128, while for loading the rest we have to use the
+     input type (??? or maybe also flattened to a equally sized non-vector
+     unsigned type?).  */
+  if ((type & BRIG_TYPE_PACK_MASK) == BRIG_TYPE_PACK_128)
+    return BRIG_TYPE_B128;
+  else if (hsa_btype_p (type))
+    {
+      unsigned bitsize = hsa_type_bit_size (type);
+      if (bitsize < 128)
+	return hsa_uint_for_bitsize (bitsize);
+    }
+  return type;
+}
+
+/* Return HSA type for tree TYPE.  If it cannot fit into BrigType16_t, some
+   kind of array will be generated, setting DIM appropriately.  Otherwise, it
+   will be set to zero.  */
+
+static BrigType16_t
+hsa_type_for_tree_type (const_tree type, unsigned HOST_WIDE_INT *dim_p = NULL,
+			bool min32int = false)
+{
+  gcc_checking_assert (TYPE_P (type));
+  if (!tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)))
+    {
+      HSA_SORRY_ATV (EXPR_LOCATION (type), "support for HSA does not "
+		     "implement huge or variable-sized type %T", type);
+      return BRIG_TYPE_NONE;
+    }
+
+  if (RECORD_OR_UNION_TYPE_P (type))
+    {
+      if (dim_p)
+	*dim_p = tree_to_uhwi (TYPE_SIZE_UNIT (type));
+      return BRIG_TYPE_U8 | BRIG_TYPE_ARRAY;
+    }
+
+  if (TREE_CODE (type) == ARRAY_TYPE)
+    {
+      /* We try to be nice and use the real base-type when this is an array of
+	 scalars and only resort to an array of bytes if the type is more
+	 complex.  */
+
+      unsigned HOST_WIDE_INT dim = 1;
+
+      while (TREE_CODE (type) == ARRAY_TYPE)
+	{
+	  tree domain = TYPE_DOMAIN (type);
+	  if (!TYPE_MIN_VALUE (domain)
+	      || !TYPE_MAX_VALUE (domain)
+	      || !tree_fits_shwi_p (TYPE_MIN_VALUE (domain))
+	      || !tree_fits_shwi_p (TYPE_MAX_VALUE (domain)))
+	    {
+	      HSA_SORRY_ATV (EXPR_LOCATION (type),
+			     "support for HSA does not implement array %T with "
+			     "unknown bounds", type);
+	      return BRIG_TYPE_NONE;
+	    }
+	  HOST_WIDE_INT min = tree_to_shwi (TYPE_MIN_VALUE (domain));
+	  HOST_WIDE_INT max = tree_to_shwi (TYPE_MAX_VALUE (domain));
+	  dim = dim * (unsigned HOST_WIDE_INT) (max - min + 1);
+	  type = TREE_TYPE (type);
+	}
+
+      BrigType16_t res;
+      if (RECORD_OR_UNION_TYPE_P (type))
+	{
+	  dim = dim * tree_to_uhwi (TYPE_SIZE_UNIT (type));
+	  res = BRIG_TYPE_U8;
+	}
+      else
+	res = hsa_type_for_scalar_tree_type (type, false);
+
+      if (dim_p)
+	*dim_p = dim;
+      return res | BRIG_TYPE_ARRAY;
+    }
+
+  /* Scalar case: */
+  if (dim_p)
+    *dim_p = 0;
+
+  return hsa_type_for_scalar_tree_type (type, min32int);
+}
+
+/* Returns true if converting from STYPE into DTYPE needs the _CVT
+   opcode.  If false a normal _MOV is enough.  */
+
+static bool
+hsa_needs_cvt (BrigType16_t dtype, BrigType16_t stype)
+{
+  if (hsa_btype_p (dtype))
+    return false;
+
+  /* float <-> int conversions are real converts.  */
+  if (hsa_type_float_p (dtype) != hsa_type_float_p (stype))
+    return true;
+  /* When both types have different size, then we need CVT as well.  */
+  if (hsa_type_bit_size (dtype) != hsa_type_bit_size (stype))
+    return true;
+  return false;
+}
+
+/* Lookup or create the associated hsa_symbol structure with a given VAR_DECL
+   or lookup the hsa_structure corresponding to a PARM_DECL.  */
+
+static hsa_symbol *
+get_symbol_for_decl (tree decl)
+{
+  hsa_symbol **slot, *sym;
+  hsa_symbol dummy (BRIG_TYPE_NONE, BRIG_SEGMENT_NONE, BRIG_LINKAGE_NONE);
+
+  gcc_assert (TREE_CODE (decl) == PARM_DECL
+	      || TREE_CODE (decl) == RESULT_DECL
+	      || TREE_CODE (decl) == VAR_DECL);
+
+  dummy.m_decl = decl;
+
+  slot = hsa_cfun->m_local_symbols->find_slot (&dummy, INSERT);
+  gcc_checking_assert (slot);
+  if (*slot)
+    {
+      sym = *slot;
+
+      /* If the symbol is problematic, mark current function also as
+	 problematic.  */
+      if (sym->m_seen_error)
+	hsa_fail_cfun ();
+
+      return sym;
+    }
+
+  if (TREE_CODE (decl) == VAR_DECL && is_global_var (decl))
+    {
+      sym = new hsa_symbol (BRIG_TYPE_NONE, BRIG_SEGMENT_READONLY,
+			    BRIG_LINKAGE_MODULE);
+
+      /* Following type of global variables can be handled.  */
+      if (TREE_READONLY (decl) && !TREE_ADDRESSABLE (decl)
+	  && DECL_INITIAL (decl) && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE
+	  && TREE_CODE (TREE_TYPE (TREE_TYPE (decl))) == INTEGER_TYPE)
+	{
+	  sym->m_cst_value = new hsa_op_immed (DECL_INITIAL (decl), false);
+	}
+      else
+	HSA_SORRY_ATV (EXPR_LOCATION (decl), "referring to global symbol "
+		       "%q+D by name from HSA code won't work", decl);
+
+      hsa_cfun->m_readonly_variables.safe_push (sym);
+    }
+  else
+    {
+      gcc_assert (TREE_CODE (decl) == VAR_DECL);
+      sym = new hsa_symbol (BRIG_TYPE_NONE, BRIG_SEGMENT_PRIVATE,
+			    BRIG_LINKAGE_FUNCTION);
+      hsa_cfun->m_private_variables.safe_push (sym);
+    }
+
+  sym->fillup_for_decl (decl);
+  sym->m_name = hsa_get_declaration_name (decl);
+  *slot = sym;
+  return sym;
+}
+
+/* For a given HSA function declaration, return a host
+   function declaration.  */
+
+tree
+hsa_get_host_function (tree decl)
+{
+  hsa_function_summary *s = hsa_summaries->get (cgraph_node::get_create (decl));
+  gcc_assert (s->m_kind != HSA_NONE);
+  gcc_assert (s->m_gpu_implementation_p);
+
+  return s->m_binded_function->decl;
+}
+
+/* Return true if function DECL has a host equivalent function.  */
+
+static char *
+get_brig_function_name (tree decl)
+{
+  tree d = decl;
+
+  hsa_function_summary *s = hsa_summaries->get (cgraph_node::get_create (d));
+  if (s->m_kind != HSA_NONE && s->m_gpu_implementation_p)
+    d = s->m_binded_function->decl;
+
+  /* IPA split can create a function that has no host equivalent.  */
+  if (d == NULL)
+    d = decl;
+
+  char *name = xstrdup (hsa_get_declaration_name (d));
+  hsa_sanitize_name (name);
+
+  return name;
+}
+
+/* Create a spill symbol of type TYPE.  */
+
+hsa_symbol *
+hsa_get_spill_symbol (BrigType16_t type)
+{
+  hsa_symbol *sym = new hsa_symbol (type, BRIG_SEGMENT_SPILL,
+				    BRIG_LINKAGE_FUNCTION);
+  hsa_cfun->m_spill_symbols.safe_push (sym);
+  return sym;
+}
+
+/* Create a symbol for a read-only string constant.  */
+hsa_symbol *
+hsa_get_string_cst_symbol (tree string_cst)
+{
+  gcc_checking_assert (TREE_CODE (string_cst) == STRING_CST);
+
+  hsa_symbol **slot = hsa_cfun->m_string_constants_map.get (string_cst);
+  if (slot)
+    return *slot;
+
+  hsa_op_immed *cst = new hsa_op_immed (string_cst);
+  hsa_symbol *sym = new hsa_symbol (cst->m_type,
+				    BRIG_SEGMENT_GLOBAL, BRIG_LINKAGE_MODULE);
+  sym->m_cst_value = cst;
+  sym->m_dim = TREE_STRING_LENGTH (string_cst);
+  sym->m_name_number = hsa_cfun->m_readonly_variables.length ();
+  sym->m_global_scope_p = true;
+
+  hsa_cfun->m_readonly_variables.safe_push (sym);
+  hsa_cfun->m_string_constants_map.put (string_cst, sym);
+  return sym;
+}
+
+/* Constructor of the ancestor of all operands.  K is BRIG kind that identified
+   what the operator is.  */
+
+hsa_op_base::hsa_op_base (BrigKind16_t k): m_next (NULL), m_brig_op_offset (0),
+  m_kind (k)
+{
+  hsa_operands.safe_push (this);
+}
+
+/* Constructor of ancestor of all operands which have a type.  K is BRIG kind
+   that identified what the operator is.  T is the type of the operator.  */
+
+hsa_op_with_type::hsa_op_with_type (BrigKind16_t k, BrigType16_t t)
+  : hsa_op_base (k), m_type (t)
+{
+}
+
+hsa_op_with_type *
+hsa_op_with_type::get_in_type (BrigType16_t dtype, hsa_bb *hbb)
+{
+  if (m_type == dtype)
+    return this;
+
+  hsa_op_reg *dest;
+
+  if (hsa_needs_cvt (dtype, m_type))
+    {
+      dest = new hsa_op_reg (dtype);
+      hbb->append_insn (new hsa_insn_cvt (dest, this));
+    }
+  else
+    {
+      dest = new hsa_op_reg (m_type);
+      hbb->append_insn (new hsa_insn_basic (2, BRIG_OPCODE_MOV,
+					    dest->m_type, dest, this));
+
+      /* We cannot simply for instance: 'mov_u32 $_3, 48 (s32)' because
+	 type of the operand must be same as type of the instruction.  */
+      dest->m_type = dtype;
+    }
+
+  return dest;
+}
+
+/* Constructor of class representing HSA immediate values.  TREE_VAL is the
+   tree representation of the immediate value.  If min32int is true,
+   always expand integer types to one that has at least 32 bits.  */
+
+hsa_op_immed::hsa_op_immed (tree tree_val, bool min32int)
+  : hsa_op_with_type (BRIG_KIND_OPERAND_CONSTANT_BYTES,
+		      hsa_type_for_tree_type (TREE_TYPE (tree_val), NULL,
+					      min32int)),
+  m_brig_repr (NULL)
+{
+  if (hsa_seen_error ())
+    return;
+
+  gcc_checking_assert ((is_gimple_min_invariant (tree_val)
+		       && (!POINTER_TYPE_P (TREE_TYPE (tree_val))
+			   || TREE_CODE (tree_val) == INTEGER_CST))
+		       || TREE_CODE (tree_val) == CONSTRUCTOR);
+  m_tree_value = tree_val;
+  m_brig_repr_size = hsa_get_imm_brig_type_len (m_type);
+
+  if (TREE_CODE (m_tree_value) == STRING_CST)
+    m_brig_repr_size = TREE_STRING_LENGTH (m_tree_value);
+  else if (TREE_CODE (m_tree_value) == CONSTRUCTOR)
+    {
+      m_brig_repr_size = tree_to_uhwi
+	(TYPE_SIZE_UNIT (TREE_TYPE (m_tree_value)));
+
+      /* Verify that all elements of a constructor are constants.  */
+      for (unsigned i = 0;
+	   i < vec_safe_length (CONSTRUCTOR_ELTS (m_tree_value)); i++)
+	{
+	  tree v = CONSTRUCTOR_ELT (m_tree_value, i)->value;
+	  if (!CONSTANT_CLASS_P (v))
+	    {
+	      HSA_SORRY_AT (EXPR_LOCATION (tree_val),
+			    "HSA ctor should have only constants");
+	      return;
+	    }
+	}
+    }
+
+  emit_to_buffer (m_tree_value);
+}
+
+/* Constructor of class representing HSA immediate values.  INTEGER_VALUE is the
+   integer representation of the immediate value.  TYPE is BRIG type.  */
+
+hsa_op_immed::hsa_op_immed (HOST_WIDE_INT integer_value, BrigKind16_t type)
+  : hsa_op_with_type (BRIG_KIND_OPERAND_CONSTANT_BYTES, type),
+  m_tree_value (NULL), m_brig_repr (NULL)
+{
+  gcc_assert (hsa_type_integer_p (type));
+  m_int_value = integer_value;
+  m_brig_repr_size = hsa_type_bit_size (type) / BITS_PER_UNIT;
+
+  hsa_bytes bytes;
+
+  switch (m_brig_repr_size)
+    {
+    case 1:
+      bytes.b8 = (uint8_t) m_int_value;
+      break;
+    case 2:
+      bytes.b16 = (uint16_t) m_int_value;
+      break;
+    case 4:
+      bytes.b32 = (uint32_t) m_int_value;
+      break;
+    case 8:
+      bytes.b64 = (uint64_t) m_int_value;
+      break;
+    default:
+      gcc_unreachable ();
+    }
+
+  m_brig_repr = XNEWVEC (char, m_brig_repr_size);
+  memcpy (m_brig_repr, &bytes, m_brig_repr_size);
+}
+
+hsa_op_immed::hsa_op_immed ():
+  hsa_op_with_type (BRIG_KIND_NONE, BRIG_TYPE_NONE), m_brig_repr (NULL)
+{
+}
+
+/* New operator to allocate immediate operands from pool alloc.  */
+
+void *
+hsa_op_immed::operator new (size_t)
+{
+  return hsa_allocp_operand_immed->vallocate ();
+}
+
+/* Destructor.  */
+
+hsa_op_immed::~hsa_op_immed ()
+{
+  free (m_brig_repr);
+}
+
+/* Change type of the immediate value to T.  */
+
+void
+hsa_op_immed::set_type (BrigType16_t t)
+{
+  m_type = t;
+}
+
+/* Constructor of class representing HSA registers and pseudo-registers.  T is
+   the BRIG type of the new register.  */
+
+hsa_op_reg::hsa_op_reg (BrigType16_t t)
+  : hsa_op_with_type (BRIG_KIND_OPERAND_REGISTER, t), m_gimple_ssa (NULL_TREE),
+  m_def_insn (NULL), m_spill_sym (NULL), m_order (hsa_cfun->m_reg_count++),
+  m_lr_begin (0), m_lr_end (0), m_reg_class (0), m_hard_num (0)
+{
+}
+
+/* New operator to allocate a register from pool alloc.  */
+
+void *
+hsa_op_reg::operator new (size_t)
+{
+  return hsa_allocp_operand_reg->vallocate ();
+}
+
+/* Verify register operand.  */
+
+void
+hsa_op_reg::verify_ssa ()
+{
+  /* Verify that each HSA register has a definition assigned.
+     Exceptions are VAR_DECL and PARM_DECL that are a default
+     definition.  */
+  gcc_checking_assert (m_def_insn
+		       || (m_gimple_ssa != NULL
+			   && (!SSA_NAME_VAR (m_gimple_ssa)
+			       || (TREE_CODE (SSA_NAME_VAR (m_gimple_ssa))
+				   != PARM_DECL))
+			   && SSA_NAME_IS_DEFAULT_DEF (m_gimple_ssa)));
+
+  /* Verify that every use of the register is really present
+     in an instruction.  */
+  for (unsigned i = 0; i < m_uses.length (); i++)
+    {
+      hsa_insn_basic *use = m_uses[i];
+
+      bool is_visited = false;
+      for (unsigned j = 0; j < use->operand_count (); j++)
+	{
+	  hsa_op_base *u = use->get_op (j);
+	  hsa_op_address *addr; addr = dyn_cast <hsa_op_address *> (u);
+	  if (addr && addr->m_reg)
+	    u = addr->m_reg;
+
+	  if (u == this)
+	    {
+	      bool r = !addr && use->op_output_p (j);
+
+	      if (r)
+		{
+		  error ("HSA SSA name defined by instruction that is supposed "
+			 "to be using it");
+		  debug_hsa_operand (this);
+		  debug_hsa_insn (use);
+		  internal_error ("HSA SSA verification failed");
+		}
+
+	      is_visited = true;
+	    }
+	}
+
+      if (!is_visited)
+	{
+	  error ("HSA SSA name not among operands of instruction that is "
+		 "supposed to use it");
+	  debug_hsa_operand (this);
+	  debug_hsa_insn (use);
+	  internal_error ("HSA SSA verification failed");
+	}
+    }
+}
+
+hsa_op_address::hsa_op_address (hsa_symbol *sym, hsa_op_reg *r,
+				HOST_WIDE_INT offset)
+  : hsa_op_base (BRIG_KIND_OPERAND_ADDRESS), m_symbol (sym), m_reg (r),
+  m_imm_offset (offset)
+{
+}
+
+hsa_op_address::hsa_op_address (hsa_symbol *sym, HOST_WIDE_INT offset)
+  : hsa_op_base (BRIG_KIND_OPERAND_ADDRESS), m_symbol (sym), m_reg (NULL),
+  m_imm_offset (offset)
+{
+}
+
+hsa_op_address::hsa_op_address (hsa_op_reg *r, HOST_WIDE_INT offset)
+  : hsa_op_base (BRIG_KIND_OPERAND_ADDRESS), m_symbol (NULL), m_reg (r),
+  m_imm_offset (offset)
+{
+}
+
+/* New operator to allocate address operands from pool alloc.  */
+
+void *
+hsa_op_address::operator new (size_t)
+{
+  return hsa_allocp_operand_address->vallocate ();
+}
+
+
+/* Constructor of an operand referring to HSAIL code.  */
+
+hsa_op_code_ref::hsa_op_code_ref () : hsa_op_base (BRIG_KIND_OPERAND_CODE_REF),
+  m_directive_offset (0)
+{
+}
+
+/* Constructor of an operand representing a code list.  Set it up so that it
+   can contain ELEMENTS number of elements.  */
+
+hsa_op_code_list::hsa_op_code_list (unsigned elements)
+  : hsa_op_base (BRIG_KIND_OPERAND_CODE_LIST)
+{
+  m_offsets.create (1);
+  m_offsets.safe_grow_cleared (elements);
+}
+
+/* New operator to allocate code list operands from pool alloc.  */
+
+void *
+hsa_op_code_list::operator new (size_t)
+{
+  return hsa_allocp_operand_code_list->vallocate ();
+}
+
+/* Constructor of an operand representing an operand list.
+   Set it up so that it can contain ELEMENTS number of elements.  */
+
+hsa_op_operand_list::hsa_op_operand_list (unsigned elements)
+  : hsa_op_base (BRIG_KIND_OPERAND_OPERAND_LIST)
+{
+  m_offsets.create (elements);
+  m_offsets.safe_grow (elements);
+}
+
+/* New operator to allocate operand list operands from pool alloc.  */
+
+void *
+hsa_op_operand_list::operator new (size_t)
+{
+  return hsa_allocp_operand_operand_list->vallocate ();
+}
+
+hsa_op_operand_list::~hsa_op_operand_list ()
+{
+  m_offsets.release ();
+}
+
+
+hsa_op_reg *
+hsa_function_representation::reg_for_gimple_ssa (tree ssa)
+{
+  hsa_op_reg *hreg;
+
+  gcc_checking_assert (TREE_CODE (ssa) == SSA_NAME);
+  if (m_ssa_map[SSA_NAME_VERSION (ssa)])
+    return m_ssa_map[SSA_NAME_VERSION (ssa)];
+
+  hreg = new hsa_op_reg (hsa_type_for_scalar_tree_type (TREE_TYPE (ssa),
+							 true));
+  hreg->m_gimple_ssa = ssa;
+  m_ssa_map[SSA_NAME_VERSION (ssa)] = hreg;
+
+  return hreg;
+}
+
+void
+hsa_op_reg::set_definition (hsa_insn_basic *insn)
+{
+  if (hsa_cfun->m_in_ssa)
+    {
+      gcc_checking_assert (!m_def_insn);
+      m_def_insn = insn;
+    }
+  else
+    m_def_insn = NULL;
+}
+
+/* Constructor of the class which is the bases of all instructions and directly
+   represents the most basic ones.  NOPS is the number of operands that the
+   operand vector will contain (and which will be cleared).  OP is the opcode
+   of the instruction.  This constructor does not set type.  */
+
+hsa_insn_basic::hsa_insn_basic (unsigned nops, int opc): m_prev (NULL),
+  m_next (NULL), m_bb (NULL), m_opcode (opc), m_number (0),
+  m_type (BRIG_TYPE_NONE), m_brig_offset (0)
+{
+  if (nops > 0)
+    m_operands.safe_grow_cleared (nops);
+
+  hsa_instructions.safe_push (this);
+}
+
+/* Make OP the operand number INDEX of operands of this instruction.  If OP is a
+   register or an address containing a register, then either set the definition
+   of the register to this instruction if it an output operand or add this
+   instruction to the uses if it is an input one.  */
+
+void
+hsa_insn_basic::set_op (int index, hsa_op_base *op)
+{
+  /* Each address operand is always use.  */
+  hsa_op_address *addr = dyn_cast <hsa_op_address *> (op);
+  if (addr && addr->m_reg)
+    addr->m_reg->m_uses.safe_push (this);
+  else
+    {
+      hsa_op_reg *reg = dyn_cast <hsa_op_reg *> (op);
+      if (reg)
+	{
+	  if (op_output_p (index))
+	    reg->set_definition (this);
+	  else
+	    reg->m_uses.safe_push (this);
+	}
+    }
+
+  m_operands[index] = op;
+}
+
+/* Get INDEX-th operand of the instruction.  */
+
+hsa_op_base *
+hsa_insn_basic::get_op (int index)
+{
+  return m_operands[index];
+}
+
+/* Get address of INDEX-th operand of the instruction.  */
+
+hsa_op_base **
+hsa_insn_basic::get_op_addr (int index)
+{
+  return &m_operands[index];
+}
+
+/* Get number of operands of the instruction.  */
+unsigned int
+hsa_insn_basic::operand_count ()
+{
+  return m_operands.length ();
+}
+
+/* Constructor of the class which is the bases of all instructions and directly
+   represents the most basic ones.  NOPS is the number of operands that the
+   operand vector will contain (and which will be cleared).  OPC is the opcode
+   of the instruction, T is the type of the instruction.  */
+
+hsa_insn_basic::hsa_insn_basic (unsigned nops, int opc, BrigType16_t t,
+				hsa_op_base *arg0, hsa_op_base *arg1,
+				hsa_op_base *arg2, hsa_op_base *arg3):
+  m_prev (NULL), m_next (NULL), m_bb (NULL), m_opcode (opc),m_number (0),
+  m_type (t),  m_brig_offset (0)
+{
+  if (nops > 0)
+    m_operands.safe_grow_cleared (nops);
+
+  if (arg0 != NULL)
+    {
+      gcc_checking_assert (nops >= 1);
+      set_op (0, arg0);
+    }
+
+  if (arg1 != NULL)
+    {
+      gcc_checking_assert (nops >= 2);
+      set_op (1, arg1);
+    }
+
+  if (arg2 != NULL)
+    {
+      gcc_checking_assert (nops >= 3);
+      set_op (2, arg2);
+    }
+
+  if (arg3 != NULL)
+    {
+      gcc_checking_assert (nops >= 4);
+      set_op (3, arg3);
+    }
+
+  hsa_instructions.safe_push (this);
+}
+
+/* New operator to allocate basic instruction from pool alloc.  */
+
+void *
+hsa_insn_basic::operator new (size_t)
+{
+  return hsa_allocp_inst_basic->vallocate ();
+}
+
+/* Verify the instruction.  */
+
+void
+hsa_insn_basic::verify ()
+{
+  hsa_op_address *addr;
+  hsa_op_reg *reg;
+
+  /* Iterate all register operands and verify that the instruction
+     is set in uses of the register.  */
+  for (unsigned i = 0; i < operand_count (); i++)
+    {
+      hsa_op_base *use = get_op (i);
+
+      if ((addr = dyn_cast <hsa_op_address *> (use)) && addr->m_reg)
+	{
+	  gcc_assert (addr->m_reg->m_def_insn != this);
+	  use = addr->m_reg;
+	}
+
+      if ((reg = dyn_cast <hsa_op_reg *> (use)) && !op_output_p (i))
+	{
+	  unsigned j;
+	  for (j = 0; j < reg->m_uses.length (); j++)
+	    {
+	      if (reg->m_uses[j] == this)
+		break;
+	    }
+
+	  if (j == reg->m_uses.length ())
+	    {
+	      error ("HSA instruction uses a register but is not among "
+		     "recorded register uses");
+	      debug_hsa_operand (reg);
+	      debug_hsa_insn (this);
+	      internal_error ("HSA instruction verification failed");
+	    }
+	}
+    }
+}
+
+/* Constructor of an instruction representing a PHI node.  NOPS is the number
+   of operands (equal to the number of predecessors).  */
+
+hsa_insn_phi::hsa_insn_phi (unsigned nops, hsa_op_reg *dst)
+  : hsa_insn_basic (nops, HSA_OPCODE_PHI), m_dest (dst)
+{
+  dst->set_definition (this);
+}
+
+/* New operator to allocate PHI instruction from pool alloc.  */
+
+void *
+hsa_insn_phi::operator new (size_t)
+{
+  return hsa_allocp_inst_phi->vallocate ();
+}
+
+/* Constructor of class representing instruction for conditional jump, CTRL is
+   the control register determining whether the jump will be carried out, the
+   new instruction is automatically added to its uses list.  */
+
+hsa_insn_br::hsa_insn_br (hsa_op_reg *ctrl)
+: hsa_insn_basic (1, BRIG_OPCODE_CBR, BRIG_TYPE_B1, ctrl),
+  m_width (BRIG_WIDTH_1)
+{
+}
+
+/* New operator to allocate branch instruction from pool alloc.  */
+
+void *
+hsa_insn_br::operator new (size_t)
+{
+  return hsa_allocp_inst_br->vallocate ();
+}
+
+/* Constructor of class representing instruction for switch jump, CTRL is
+   the index register.  */
+
+hsa_insn_sbr::hsa_insn_sbr (hsa_op_reg *index, unsigned jump_count)
+: hsa_insn_basic (1, BRIG_OPCODE_SBR, BRIG_TYPE_B1, index),
+  m_width (BRIG_WIDTH_1), m_jump_table (vNULL), m_default_bb (NULL),
+  m_label_code_list (new hsa_op_code_list (jump_count))
+{
+}
+
+/* New operator to allocate switch branch instruction from pool alloc.  */
+
+void *
+hsa_insn_sbr::operator new (size_t)
+{
+  return hsa_allocp_inst_sbr->vallocate ();
+}
+
+/* Replace all occurrences of OLD_BB with NEW_BB in the statements
+   jump table.  */
+
+void
+hsa_insn_sbr::replace_all_labels (basic_block old_bb, basic_block new_bb)
+{
+  for (unsigned i = 0; i < m_jump_table.length (); i++)
+    if (m_jump_table[i] == old_bb)
+      m_jump_table[i] = new_bb;
+}
+
+hsa_insn_sbr::~hsa_insn_sbr ()
+{
+  m_jump_table.release ();
+}
+
+/* Constructor of comparison instruction.  CMP is the comparison operation and T
+   is the result type.  */
+
+hsa_insn_cmp::hsa_insn_cmp (BrigCompareOperation8_t cmp, BrigType16_t t,
+			    hsa_op_base *arg0, hsa_op_base *arg1,
+			    hsa_op_base *arg2)
+  : hsa_insn_basic (3 , BRIG_OPCODE_CMP, t, arg0, arg1, arg2), m_compare (cmp)
+{
+}
+
+/* New operator to allocate compare instruction from pool alloc.  */
+
+void *
+hsa_insn_cmp::operator new (size_t)
+{
+  return hsa_allocp_inst_cmp->vallocate ();
+}
+
+/* Constructor of classes representing memory accesses.  OPC is the opcode (must
+   be BRIG_OPCODE_ST or BRIG_OPCODE_LD) and T is the type.  The instruction
+   operands are provided as ARG0 and ARG1.  */
+
+hsa_insn_mem::hsa_insn_mem (int opc, BrigType16_t t, hsa_op_base *arg0,
+			    hsa_op_base *arg1)
+  : hsa_insn_basic (2, opc, t, arg0, arg1),
+  m_align (hsa_natural_alignment (t)), m_equiv_class (0)
+{
+  gcc_checking_assert (opc == BRIG_OPCODE_LD || opc == BRIG_OPCODE_ST);
+}
+
+/* Constructor for descendants allowing different opcodes and number of
+   operands, it passes its arguments directly to hsa_insn_basic
+   constructor.  The instruction operands are provided as ARG[0-3].  */
+
+
+hsa_insn_mem::hsa_insn_mem (unsigned nops, int opc, BrigType16_t t,
+			    hsa_op_base *arg0, hsa_op_base *arg1,
+			    hsa_op_base *arg2, hsa_op_base *arg3)
+  : hsa_insn_basic (nops, opc, t, arg0, arg1, arg2, arg3),
+  m_align (hsa_natural_alignment (t)), m_equiv_class (0)
+{
+}
+
+/* New operator to allocate memory instruction from pool alloc.  */
+
+void *
+hsa_insn_mem::operator new (size_t)
+{
+  return hsa_allocp_inst_mem->vallocate ();
+}
+
+/* Constructor of class representing atomic instructions and signals. OPC is
+   the principal opcode, aop is the specific atomic operation opcode.  T is the
+   type of the instruction.  The instruction operands
+   are provided as ARG[0-3].  */
+
+hsa_insn_atomic::hsa_insn_atomic (int nops, int opc,
+				  enum BrigAtomicOperation aop,
+				  BrigType16_t t, BrigMemoryOrder memorder,
+				  hsa_op_base *arg0,
+				  hsa_op_base *arg1, hsa_op_base *arg2,
+				  hsa_op_base *arg3)
+  : hsa_insn_mem (nops, opc, t, arg0, arg1, arg2, arg3), m_atomicop (aop),
+  m_memoryorder (memorder),
+  m_memoryscope (BRIG_MEMORY_SCOPE_SYSTEM)
+{
+  gcc_checking_assert (opc == BRIG_OPCODE_ATOMICNORET ||
+		       opc == BRIG_OPCODE_ATOMIC ||
+		       opc == BRIG_OPCODE_SIGNAL ||
+		       opc == BRIG_OPCODE_SIGNALNORET);
+}
+
+/* New operator to allocate signal instruction from pool alloc.  */
+
+void *
+hsa_insn_atomic::operator new (size_t)
+{
+  return hsa_allocp_inst_atomic->vallocate ();
+}
+
+/* Constructor of class representing signal instructions.  OPC is the prinicpal
+   opcode, sop is the specific signal operation opcode.  T is the type of the
+   instruction.  The instruction operands are provided as ARG[0-3].  */
+
+hsa_insn_signal::hsa_insn_signal (int nops, int opc,
+				  enum BrigAtomicOperation sop,
+				  BrigType16_t t, hsa_op_base *arg0,
+				  hsa_op_base *arg1, hsa_op_base *arg2,
+				  hsa_op_base *arg3)
+  : hsa_insn_atomic (nops, opc, sop, t, BRIG_MEMORY_ORDER_SC_ACQUIRE_RELEASE,
+		     arg0, arg1, arg2, arg3)
+{
+}
+
+/* New operator to allocate signal instruction from pool alloc.  */
+
+void *
+hsa_insn_signal::operator new (size_t)
+{
+  return hsa_allocp_inst_signal->vallocate ();
+}
+
+/* Constructor of class representing segment conversion instructions.  OPC is
+   the opcode which must be either BRIG_OPCODE_STOF or BRIG_OPCODE_FTOS.  DEST
+   and SRCT are destination and source types respectively, SEG is the segment
+   we are converting to or from.  The instruction operands are
+   provided as ARG0 and ARG1.  */
+
+hsa_insn_seg::hsa_insn_seg (int opc, BrigType16_t dest, BrigType16_t srct,
+			    BrigSegment8_t seg, hsa_op_base *arg0,
+			    hsa_op_base *arg1)
+  : hsa_insn_basic (2, opc, dest, arg0, arg1), m_src_type (srct),
+  m_segment (seg)
+{
+  gcc_checking_assert (opc == BRIG_OPCODE_STOF || opc == BRIG_OPCODE_FTOS);
+}
+
+/* New operator to allocate address conversion instruction from pool alloc.  */
+
+void *
+hsa_insn_seg::operator new (size_t)
+{
+  return hsa_allocp_inst_seg->vallocate ();
+}
+
+/* Constructor of class representing a call instruction.  CALLEE is the tree
+   representation of the function being called.  */
+
+hsa_insn_call::hsa_insn_call (tree callee)
+  : hsa_insn_basic (0, BRIG_OPCODE_CALL), m_called_function (callee),
+  m_output_arg (NULL), m_args_code_list (NULL), m_result_code_list (NULL)
+{
+}
+
+/* New operator to allocate call instruction from pool alloc.  */
+
+void *
+hsa_insn_call::operator new (size_t)
+{
+  return hsa_allocp_inst_call->vallocate ();
+}
+
+hsa_insn_call::~hsa_insn_call ()
+{
+  for (unsigned i = 0; i < m_input_args.length (); i++)
+    delete m_input_args[i];
+
+  delete m_output_arg;
+
+  m_input_args.release ();
+  m_input_arg_insns.release ();
+}
+
+/* Constructor of class representing the argument block required to invoke
+   a call in HSAIL.  */
+hsa_insn_arg_block::hsa_insn_arg_block (BrigKind brig_kind,
+					hsa_insn_call * call)
+  : hsa_insn_basic (0, HSA_OPCODE_ARG_BLOCK), m_kind (brig_kind),
+  m_call_insn (call)
+{
+}
+
+/* New operator to allocate argument block instruction from pool alloc.  */
+
+void *
+hsa_insn_arg_block::operator new (size_t)
+{
+  return hsa_allocp_inst_arg_block->vallocate ();
+}
+
+hsa_insn_comment::hsa_insn_comment (const char *s)
+  : hsa_insn_basic (0, BRIG_KIND_DIRECTIVE_COMMENT)
+{
+  unsigned l = strlen (s);
+
+  /* Append '// ' to the string.  */
+  char *buf = XNEWVEC (char, l + 4);
+  sprintf (buf, "// %s", s);
+  m_comment = buf;
+}
+
+/* New operator to allocate comment instruction from pool alloc.  */
+
+void *
+hsa_insn_comment::operator new (size_t)
+{
+  return hsa_allocp_inst_comment->vallocate ();
+}
+
+hsa_insn_comment::~hsa_insn_comment ()
+{
+  gcc_checking_assert (m_comment);
+  free (m_comment);
+  m_comment = NULL;
+}
+
+/* Constructor of class representing the queue instruction in HSAIL.  */
+hsa_insn_queue::hsa_insn_queue (int nops, BrigOpcode opcode)
+  : hsa_insn_basic (nops, opcode, BRIG_TYPE_U64)
+{
+}
+
+/* New operator to allocate packed instruction from pool alloc.  */
+
+void *
+hsa_insn_packed::operator new (size_t)
+{
+  return hsa_allocp_inst_packed->vallocate ();
+}
+
+/* Constructor of class representing the packed instruction in HSAIL.  */
+
+hsa_insn_packed::hsa_insn_packed (int nops, BrigOpcode opcode,
+				  BrigType16_t destt, BrigType16_t srct,
+				  hsa_op_base *arg0, hsa_op_base *arg1,
+				  hsa_op_base *arg2)
+  : hsa_insn_basic (nops, opcode, destt, arg0, arg1, arg2),
+  m_source_type (srct)
+{
+  m_operand_list = new hsa_op_operand_list (nops - 1);
+}
+
+/* New operator to allocate convert instruction from pool alloc.  */
+
+void *
+hsa_insn_cvt::operator new (size_t)
+{
+  return hsa_allocp_inst_cvt->vallocate ();
+}
+
+/* Constructor of class representing the convert instruction in HSAIL.  */
+
+hsa_insn_cvt::hsa_insn_cvt (hsa_op_with_type *dest, hsa_op_with_type *src)
+  : hsa_insn_basic (2, BRIG_OPCODE_CVT, dest->m_type, dest, src)
+{
+}
+
+/* Append an instruction INSN into the basic block.  */
+
+void
+hsa_bb::append_insn (hsa_insn_basic *insn)
+{
+  gcc_assert (insn->m_opcode != 0 || insn->operand_count () == 0);
+  gcc_assert (!insn->m_bb);
+
+  insn->m_bb = m_bb;
+  insn->m_prev = m_last_insn;
+  insn->m_next = NULL;
+  if (m_last_insn)
+    m_last_insn->m_next = insn;
+  m_last_insn = insn;
+  if (!m_first_insn)
+    m_first_insn = insn;
+}
+
+/* Insert HSA instruction NEW_INSN immediately before an existing instruction
+   OLD_INSN.  */
+
+static void
+hsa_insert_insn_before (hsa_insn_basic *new_insn, hsa_insn_basic *old_insn)
+{
+  hsa_bb *hbb = hsa_bb_for_bb (old_insn->m_bb);
+
+  if (hbb->m_first_insn == old_insn)
+    hbb->m_first_insn = new_insn;
+  new_insn->m_prev = old_insn->m_prev;
+  new_insn->m_next = old_insn;
+  if (old_insn->m_prev)
+    old_insn->m_prev->m_next = new_insn;
+  old_insn->m_prev = new_insn;
+}
+
+/* Append HSA instruction NEW_INSN immediately after an existing instruction
+   OLD_INSN.  */
+
+static void
+hsa_append_insn_after (hsa_insn_basic *new_insn, hsa_insn_basic *old_insn)
+{
+  hsa_bb *hbb = hsa_bb_for_bb (old_insn->m_bb);
+
+  if (hbb->m_last_insn == old_insn)
+    hbb->m_last_insn = new_insn;
+  new_insn->m_prev = old_insn;
+  new_insn->m_next = old_insn->m_next;
+  if (old_insn->m_next)
+    old_insn->m_next->m_prev = new_insn;
+  old_insn->m_next = new_insn;
+}
+
+/* Return a register containing the calculated value of EXP which must be an
+   expression consisting of PLUS_EXPRs, MULT_EXPRs, NOP_EXPRs, SSA_NAMEs and
+   integer constants as returned by get_inner_reference.
+   Newly generated HSA instructions will be appended to HBB.
+   Perform all calculations in ADDRTYPE.  */
+
+static hsa_op_with_type *
+gen_address_calculation (tree exp, hsa_bb *hbb, BrigType16_t addrtype)
+{
+  int opcode;
+
+  if (TREE_CODE (exp) == NOP_EXPR)
+    exp = TREE_OPERAND (exp, 0);
+
+  switch (TREE_CODE (exp))
+    {
+    case SSA_NAME:
+      return hsa_cfun->reg_for_gimple_ssa (exp)->get_in_type (addrtype, hbb);
+
+    case INTEGER_CST:
+      {
+       hsa_op_immed *imm = new hsa_op_immed (exp);
+       if (addrtype != imm->m_type)
+	 imm->m_type = addrtype;
+       return imm;
+      }
+
+    case PLUS_EXPR:
+      opcode = BRIG_OPCODE_ADD;
+      break;
+
+    case MULT_EXPR:
+      opcode = BRIG_OPCODE_MUL;
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  hsa_op_reg *res = new hsa_op_reg (addrtype);
+  hsa_insn_basic *insn = new hsa_insn_basic (3, opcode, addrtype);
+  insn->set_op (0, res);
+
+  hsa_op_with_type *op1 = gen_address_calculation (TREE_OPERAND (exp, 0), hbb,
+						   addrtype);
+  hsa_op_with_type *op2 = gen_address_calculation (TREE_OPERAND (exp, 1), hbb,
+						   addrtype);
+  insn->set_op (1, op1);
+  insn->set_op (2, op2);
+
+  hbb->append_insn (insn);
+  return res;
+}
+
+/* If R1 is NULL, just return R2, otherwise append an instruction adding them
+   to HBB and return the register holding the result.  */
+
+static hsa_op_reg *
+add_addr_regs_if_needed (hsa_op_reg *r1, hsa_op_reg *r2, hsa_bb *hbb)
+{
+  gcc_checking_assert (r2);
+  if (!r1)
+    return r2;
+
+  hsa_op_reg *res = new hsa_op_reg (r1->m_type);
+  gcc_assert (!hsa_needs_cvt (r1->m_type, r2->m_type));
+  hsa_insn_basic *insn = new hsa_insn_basic (3, BRIG_OPCODE_ADD, res->m_type);
+  insn->set_op (0, res);
+  insn->set_op (1, r1);
+  insn->set_op (2, r2);
+  hbb->append_insn (insn);
+  return res;
+}
+
+/* Helper of gen_hsa_addr.  Update *SYMBOL, *ADDRTYPE, *REG and *OFFSET to
+   reflect BASE which is the first operand of a MEM_REF or a TARGET_MEM_REF.  */
+
+static void
+process_mem_base (tree base, hsa_symbol **symbol, BrigType16_t *addrtype,
+		  hsa_op_reg **reg, offset_int *offset, hsa_bb *hbb)
+{
+  if (TREE_CODE (base) == SSA_NAME)
+    {
+      gcc_assert (!*reg);
+      hsa_op_with_type *ssa = hsa_cfun->reg_for_gimple_ssa (base)->get_in_type
+	(*addrtype, hbb);
+      *reg = dyn_cast <hsa_op_reg *> (ssa);
+    }
+  else if (TREE_CODE (base) == ADDR_EXPR)
+    {
+      tree decl = TREE_OPERAND (base, 0);
+
+      if (!DECL_P (decl) || TREE_CODE (decl) == FUNCTION_DECL)
+	{
+	  HSA_SORRY_AT (EXPR_LOCATION (base),
+			"support for HSA does not implement a memory reference "
+			"to a non-declaration type");
+	  return;
+	}
+
+      gcc_assert (!*symbol);
+
+      *symbol = get_symbol_for_decl (decl);
+      *addrtype = hsa_get_segment_addr_type ((*symbol)->m_segment);
+    }
+  else if (TREE_CODE (base) == INTEGER_CST)
+    *offset += wi::to_offset (base);
+  else
+    gcc_unreachable ();
+}
+
+/* Forward declaration of a function.  */
+
+static void
+gen_hsa_addr_insns (tree val, hsa_op_reg *dest, hsa_bb *hbb);
+
+/* Generate HSA address operand for a given tree memory reference REF.  If
+   instructions need to be created to calculate the address, they will be added
+   to the end of HBB.  If a caller provider OUTPUT_BITSIZE and OUTPUT_BITPOS,
+   the function assumes that the caller will handle possible
+   bit-field references.  Otherwise if we reference a bit-field, sorry message
+   is displayed.  */
+
+static hsa_op_address *
+gen_hsa_addr (tree ref, hsa_bb *hbb, HOST_WIDE_INT *output_bitsize = NULL,
+	      HOST_WIDE_INT *output_bitpos = NULL)
+{
+  hsa_symbol *symbol = NULL;
+  hsa_op_reg *reg = NULL;
+  offset_int offset = 0;
+  tree origref = ref;
+  tree varoffset = NULL_TREE;
+  BrigType16_t addrtype = hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT);
+  HOST_WIDE_INT bitsize = 0, bitpos = 0;
+  BrigType16_t flat_addrtype = hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT);
+
+  if (TREE_CODE (ref) == STRING_CST)
+    {
+      symbol = hsa_get_string_cst_symbol (ref);
+      goto out;
+    }
+  else if (TREE_CODE (ref) == BIT_FIELD_REF
+	   && ((tree_to_uhwi (TREE_OPERAND (ref, 1)) % BITS_PER_UNIT) != 0
+	       || (tree_to_uhwi (TREE_OPERAND (ref, 2)) % BITS_PER_UNIT) != 0))
+    {
+      HSA_SORRY_ATV (EXPR_LOCATION (origref),
+		     "support for HSA does not implement "
+		     "bit field references such as %E", ref);
+      goto out;
+    }
+
+  if (handled_component_p (ref))
+    {
+      enum machine_mode mode;
+      int unsignedp, volatilep;
+
+      ref = get_inner_reference (ref, &bitsize, &bitpos, &varoffset, &mode,
+				 &unsignedp, &volatilep, false);
+
+      offset = bitpos;
+      offset = wi::rshift (offset, LOG2_BITS_PER_UNIT, SIGNED);
+    }
+
+  switch (TREE_CODE (ref))
+    {
+    case ADDR_EXPR:
+      {
+	addrtype = hsa_get_segment_addr_type (BRIG_SEGMENT_PRIVATE);
+	symbol = hsa_cfun->create_hsa_temporary (flat_addrtype);
+	hsa_op_reg *r = new hsa_op_reg (flat_addrtype);
+	gen_hsa_addr_insns (ref, r, hbb);
+	hbb->append_insn (new hsa_insn_mem (BRIG_OPCODE_ST, r->m_type,
+					    r, new hsa_op_address (symbol)));
+
+	break;
+      }
+    case SSA_NAME:
+      {
+	addrtype = hsa_get_segment_addr_type (BRIG_SEGMENT_PRIVATE);
+	symbol = hsa_cfun->create_hsa_temporary (flat_addrtype);
+	hsa_op_reg *r = hsa_cfun->reg_for_gimple_ssa (ref);
+
+	hbb->append_insn (new hsa_insn_mem (BRIG_OPCODE_ST, r->m_type,
+					    r, new hsa_op_address (symbol)));
+
+	break;
+      }
+    case PARM_DECL:
+    case VAR_DECL:
+    case RESULT_DECL:
+      gcc_assert (!symbol);
+      symbol = get_symbol_for_decl (ref);
+      addrtype = hsa_get_segment_addr_type (symbol->m_segment);
+      break;
+
+    case MEM_REF:
+      process_mem_base (TREE_OPERAND (ref, 0), &symbol, &addrtype, &reg,
+			&offset, hbb);
+
+      if (!integer_zerop (TREE_OPERAND (ref, 1)))
+	offset += wi::to_offset (TREE_OPERAND (ref, 1));
+      break;
+
+    case TARGET_MEM_REF:
+      process_mem_base (TMR_BASE (ref), &symbol, &addrtype, &reg, &offset, hbb);
+      if (TMR_INDEX (ref))
+	{
+	  hsa_op_reg *disp1;
+	  hsa_op_base *idx = hsa_cfun->reg_for_gimple_ssa
+	    (TMR_INDEX (ref))->get_in_type (addrtype, hbb);
+	  if (TMR_STEP (ref) && !integer_onep (TMR_STEP (ref)))
+	    {
+	      disp1 = new hsa_op_reg (addrtype);
+	      hsa_insn_basic *insn = new hsa_insn_basic (3, BRIG_OPCODE_MUL,
+							 addrtype);
+
+	      /* As step must respect addrtype, we overwrite the type
+		 of an immediate value.  */
+	      hsa_op_immed *step = new hsa_op_immed (TMR_STEP (ref));
+	      step->m_type = addrtype;
+
+	      insn->set_op (0, disp1);
+	      insn->set_op (1, idx);
+	      insn->set_op (2, step);
+	      hbb->append_insn (insn);
+	    }
+	  else
+	    disp1 = as_a <hsa_op_reg *> (idx);
+	  reg = add_addr_regs_if_needed (reg, disp1, hbb);
+	}
+      if (TMR_INDEX2 (ref))
+	{
+	  hsa_op_base *disp2 = hsa_cfun->reg_for_gimple_ssa
+	    (TMR_INDEX2 (ref))->get_in_type (addrtype, hbb);
+	  reg = add_addr_regs_if_needed (reg, as_a <hsa_op_reg *> (disp2), hbb);
+	}
+      offset += wi::to_offset (TMR_OFFSET (ref));
+      break;
+    case FUNCTION_DECL:
+      HSA_SORRY_AT (EXPR_LOCATION (origref),
+		    "support for HSA does not implement function pointers");
+      goto out;
+    default:
+      HSA_SORRY_ATV (EXPR_LOCATION (origref), "support for HSA does "
+		     "not implement memory access to %E", origref);
+      goto out;
+    }
+
+  if (varoffset)
+    {
+      if (TREE_CODE (varoffset) == INTEGER_CST)
+	offset += wi::to_offset (varoffset);
+      else
+	{
+	  hsa_op_base *off_op = gen_address_calculation (varoffset, hbb,
+							 addrtype);
+	  reg = add_addr_regs_if_needed (reg, as_a <hsa_op_reg *> (off_op),
+					 hbb);
+	}
+    }
+
+  gcc_checking_assert ((symbol
+			&& addrtype
+			== hsa_get_segment_addr_type (symbol->m_segment))
+		       || (!symbol
+			   && addrtype
+			   == hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT)));
+out:
+  HOST_WIDE_INT hwi_offset = offset.to_shwi ();
+
+  /* Calculate remaining bitsize offset (if presented).  */
+  bitpos %= BITS_PER_UNIT;
+  /* If bitsize is a power of two that is greater or equal to BITS_PER_UNIT, it
+     is not a reason to think this is a bit-field access.  */
+  if (bitpos == 0
+      && (bitsize >= BITS_PER_UNIT)
+      && !(bitsize & (bitsize - 1)))
+    bitsize = 0;
+
+  if ((bitpos || bitsize) && (output_bitpos == NULL || output_bitsize == NULL))
+    HSA_SORRY_ATV (EXPR_LOCATION (origref), "support for HSA does not "
+		   "implement unhandled bit field reference such as %E", ref);
+
+  if (output_bitsize != NULL && output_bitpos != NULL)
+    {
+      *output_bitsize = bitsize;
+      *output_bitpos = bitpos;
+    }
+
+  return new hsa_op_address (symbol, reg, hwi_offset);
+}
+
+/* Generate HSA address for a function call argument of given TYPE.
+   INDEX is used to generate corresponding name of the arguments.
+   Special value -1 represents fact that result value is created.  */
+
+static hsa_op_address *
+gen_hsa_addr_for_arg (tree tree_type, int index)
+{
+  hsa_symbol *sym = new hsa_symbol (BRIG_TYPE_NONE, BRIG_SEGMENT_ARG,
+				    BRIG_LINKAGE_ARG);
+  sym->m_type = hsa_type_for_tree_type (tree_type, &sym->m_dim);
+
+  if (index == -1) /* Function result.  */
+    sym->m_name = "res";
+  else /* Function call arguments.  */
+    {
+      sym->m_name = NULL;
+      sym->m_name_number = index;
+    }
+
+  return new hsa_op_address (sym);
+}
+
+/* Generate HSA instructions that calculate address of VAL including all
+   necessary conversions to flat addressing and place the result into DEST.
+   Instructions are appended to HBB.  */
+
+static void
+gen_hsa_addr_insns (tree val, hsa_op_reg *dest, hsa_bb *hbb)
+{
+  /* Handle cases like tmp = NULL, where we just emit a move instruction
+     to a register.  */
+  if (TREE_CODE (val) == INTEGER_CST)
+    {
+      hsa_op_immed *c = new hsa_op_immed (val);
+      hsa_insn_basic *insn = new hsa_insn_basic (2, BRIG_OPCODE_MOV,
+						 dest->m_type, dest, c);
+      hbb->append_insn (insn);
+      return;
+    }
+
+  hsa_op_address *addr;
+
+  gcc_assert (dest->m_type == hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT));
+  if (TREE_CODE (val) == ADDR_EXPR)
+    val = TREE_OPERAND (val, 0);
+  addr = gen_hsa_addr (val, hbb);
+  hsa_insn_basic *insn = new hsa_insn_basic (2, BRIG_OPCODE_LDA);
+  insn->set_op (1, addr);
+  if (addr->m_symbol && addr->m_symbol->m_segment != BRIG_SEGMENT_GLOBAL)
+    {
+      /* LDA produces segment-relative address, we need to convert
+	 it to the flat one.  */
+      hsa_op_reg *tmp;
+      tmp = new hsa_op_reg (hsa_get_segment_addr_type
+			    (addr->m_symbol->m_segment));
+      hsa_insn_seg *seg;
+      seg = new hsa_insn_seg (BRIG_OPCODE_STOF,
+			      hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT),
+			      tmp->m_type, addr->m_symbol->m_segment, dest,
+			      tmp);
+
+      insn->set_op (0, tmp);
+      insn->m_type = tmp->m_type;
+      hbb->append_insn (insn);
+      hbb->append_insn (seg);
+    }
+  else
+    {
+      insn->set_op (0, dest);
+      insn->m_type = hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT);
+      hbb->append_insn (insn);
+    }
+}
+
+/* Return an HSA register or HSA immediate value operand corresponding to
+   gimple operand OP.  */
+
+static hsa_op_with_type *
+hsa_reg_or_immed_for_gimple_op (tree op, hsa_bb *hbb)
+{
+  hsa_op_reg *tmp;
+
+  if (TREE_CODE (op) == SSA_NAME)
+    tmp = hsa_cfun->reg_for_gimple_ssa (op);
+  else if (!POINTER_TYPE_P (TREE_TYPE (op)))
+    return new hsa_op_immed (op);
+  else
+    {
+      tmp = new hsa_op_reg (hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT));
+      gen_hsa_addr_insns (op, tmp, hbb);
+    }
+  return tmp;
+}
+
+/* Create a simple movement instruction with register destination DEST and
+   register or immediate source SRC and append it to the end of HBB.  */
+
+void
+hsa_build_append_simple_mov (hsa_op_reg *dest, hsa_op_base *src, hsa_bb *hbb)
+{
+  hsa_insn_basic *insn = new hsa_insn_basic (2, BRIG_OPCODE_MOV, dest->m_type,
+					     dest, src);
+  if (hsa_op_reg *sreg = dyn_cast <hsa_op_reg *> (src))
+    gcc_assert (hsa_type_bit_size (dest->m_type)
+		== hsa_type_bit_size (sreg->m_type));
+  else
+    gcc_assert (hsa_type_bit_size (dest->m_type)
+		== hsa_type_bit_size (as_a <hsa_op_immed *> (src)->m_type));
+
+  hbb->append_insn (insn);
+}
+
+/* Generate HSAIL instructions loading a bit field into register DEST.
+   VALUE_REG is a register of a SSA name that is used in the bit field
+   reference.  To identify a bit field BITPOS is offset to the loaded memory
+   and BITSIZE is number of bits of the bit field.
+   Add instructions to HBB.  */
+
+static void
+gen_hsa_insns_for_bitfield (hsa_op_reg *dest, hsa_op_reg *value_reg,
+			    HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
+			    hsa_bb *hbb)
+{
+  unsigned type_bitsize = hsa_type_bit_size (dest->m_type);
+  unsigned left_shift = type_bitsize - (bitsize + bitpos);
+  unsigned right_shift = left_shift + bitpos;
+
+  if (left_shift)
+    {
+      hsa_op_reg *value_reg_2 = new hsa_op_reg (dest->m_type);
+      hsa_op_immed *c = new hsa_op_immed (left_shift, BRIG_TYPE_U32);
+
+      hsa_insn_basic *lshift = new hsa_insn_basic
+	(3, BRIG_OPCODE_SHL, value_reg_2->m_type, value_reg_2, value_reg, c);
+
+      hbb->append_insn (lshift);
+
+      value_reg = value_reg_2;
+    }
+
+  if (right_shift)
+    {
+      hsa_op_reg *value_reg_2 = new hsa_op_reg (dest->m_type);
+      hsa_op_immed *c = new hsa_op_immed (right_shift, BRIG_TYPE_U32);
+
+      hsa_insn_basic *rshift = new hsa_insn_basic
+	(3, BRIG_OPCODE_SHR, value_reg_2->m_type, value_reg_2, value_reg, c);
+
+      hbb->append_insn (rshift);
+
+      value_reg = value_reg_2;
+    }
+
+    hsa_insn_basic *assignment = new hsa_insn_basic
+      (2, BRIG_OPCODE_MOV, dest->m_type, dest, value_reg);
+    hbb->append_insn (assignment);
+}
+
+
+/* Generate HSAIL instructions loading a bit field into register DEST.  ADDR is
+   prepared memory address which is used to load the bit field.  To identify a
+   bit field BITPOS is offset to the loaded memory and BITSIZE is number of
+   bits of the bit field.  Add instructions to HBB.  Load must be performed in
+   alignment ALIGN.  */
+
+static void
+gen_hsa_insns_for_bitfield_load (hsa_op_reg *dest, hsa_op_address *addr,
+				 HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos,
+				 hsa_bb *hbb, BrigAlignment8_t align)
+{
+  hsa_op_reg *value_reg = new hsa_op_reg (dest->m_type);
+  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_LD, dest->m_type, value_reg,
+					addr);
+  mem->set_align (align);
+  hbb->append_insn (mem);
+  gen_hsa_insns_for_bitfield (dest, value_reg, bitsize, bitpos, hbb);
+}
+
+/* Return the alignment of base memory accesses we issue to perform bit-field
+   memory access REF.  */
+
+static BrigAlignment8_t
+hsa_bitmemref_alignment (tree ref)
+{
+  unsigned HOST_WIDE_INT bit_offset = 0;
+
+  while (true)
+    {
+      if (TREE_CODE (ref) == BIT_FIELD_REF)
+	{
+	  if (!tree_fits_uhwi_p (TREE_OPERAND (ref, 2)))
+	    return BRIG_ALIGNMENT_1;
+	  bit_offset += tree_to_uhwi (TREE_OPERAND (ref, 2));
+	}
+      else if (TREE_CODE (ref) == COMPONENT_REF
+	       && DECL_BIT_FIELD (TREE_OPERAND (ref, 1)))
+	bit_offset += int_bit_position (TREE_OPERAND (ref, 1));
+      else
+	break;
+      ref = TREE_OPERAND (ref, 0);
+    }
+
+  unsigned HOST_WIDE_INT bits = bit_offset % BITS_PER_UNIT;
+  unsigned HOST_WIDE_INT byte_bits = bit_offset - bits;
+  BrigAlignment8_t base = hsa_alignment_encoding (get_object_alignment (ref));
+  if (byte_bits == 0)
+    return base;
+  return MIN (base, hsa_alignment_encoding (byte_bits & -byte_bits));
+}
+
+/* Generate HSAIL instructions loading something into register DEST.  RHS is
+   tree representation of the loaded data, which are loaded as type TYPE.  Add
+   instructions to HBB.  */
+
+static void
+gen_hsa_insns_for_load (hsa_op_reg *dest, tree rhs, tree type, hsa_bb *hbb)
+{
+  /* The destination SSA name will give us the type.  */
+  if (TREE_CODE (rhs) == VIEW_CONVERT_EXPR)
+    rhs = TREE_OPERAND (rhs, 0);
+
+  if (TREE_CODE (rhs) == SSA_NAME)
+    {
+      hsa_op_reg *src = hsa_cfun->reg_for_gimple_ssa (rhs);
+      hsa_build_append_simple_mov (dest, src, hbb);
+    }
+  else if (is_gimple_min_invariant (rhs)
+	   || TREE_CODE (rhs) == ADDR_EXPR)
+    {
+      if (POINTER_TYPE_P (TREE_TYPE (rhs)))
+	{
+	  if (dest->m_type != hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT))
+	    {
+	      HSA_SORRY_ATV (EXPR_LOCATION (rhs),
+			     "support for HSA does not implement conversion "
+			     "of %E to the requested non-pointer type.", rhs);
+	      return;
+	    }
+
+	  gen_hsa_addr_insns (rhs, dest, hbb);
+	}
+      else if (TREE_CODE (rhs) == COMPLEX_CST)
+	{
+	  hsa_op_immed *real_part = new hsa_op_immed (TREE_REALPART (rhs));
+	  hsa_op_immed *imag_part = new hsa_op_immed (TREE_IMAGPART (rhs));
+
+	  hsa_op_reg *real_part_reg = new hsa_op_reg
+	    (hsa_type_for_scalar_tree_type (TREE_TYPE (type), true));
+	  hsa_op_reg *imag_part_reg = new hsa_op_reg
+	    (hsa_type_for_scalar_tree_type (TREE_TYPE (type), true));
+
+	  hsa_build_append_simple_mov (real_part_reg, real_part, hbb);
+	  hsa_build_append_simple_mov (imag_part_reg, imag_part, hbb);
+
+	  BrigType16_t src_type = hsa_bittype_for_type (real_part_reg->m_type);
+
+	  hsa_insn_packed *insn = new hsa_insn_packed
+	    (3, BRIG_OPCODE_COMBINE, dest->m_type, src_type, dest,
+	     real_part_reg, imag_part_reg);
+	  hbb->append_insn (insn);
+	}
+      else
+	{
+	  hsa_op_immed *imm = new hsa_op_immed (rhs);
+	  hsa_build_append_simple_mov (dest, imm, hbb);
+	}
+    }
+  else if (TREE_CODE (rhs) == REALPART_EXPR || TREE_CODE (rhs) == IMAGPART_EXPR)
+    {
+      tree pack_type = TREE_TYPE (TREE_OPERAND (rhs, 0));
+
+      hsa_op_reg *packed_reg = new hsa_op_reg
+	(hsa_type_for_scalar_tree_type (pack_type, true));
+
+      tree complex_rhs = TREE_OPERAND (rhs, 0);
+      gen_hsa_insns_for_load (packed_reg, complex_rhs, TREE_TYPE (complex_rhs),
+			      hbb);
+
+      hsa_op_reg *real_reg = new hsa_op_reg
+	(hsa_type_for_scalar_tree_type (type, true));
+
+      hsa_op_reg *imag_reg = new hsa_op_reg
+	(hsa_type_for_scalar_tree_type (type, true));
+
+      BrigKind16_t brig_type = packed_reg->m_type;
+      hsa_insn_packed *packed = new hsa_insn_packed
+	(3, BRIG_OPCODE_EXPAND, hsa_bittype_for_type (real_reg->m_type),
+	 brig_type, real_reg, imag_reg, packed_reg);
+
+      hbb->append_insn (packed);
+
+      hsa_op_reg *source = TREE_CODE (rhs) == REALPART_EXPR ?
+	real_reg : imag_reg;
+
+      hsa_insn_basic *insn = new hsa_insn_basic (2, BRIG_OPCODE_MOV,
+						 dest->m_type, dest, source);
+
+      hbb->append_insn (insn);
+    }
+  else if (TREE_CODE (rhs) == BIT_FIELD_REF
+	   && TREE_CODE (TREE_OPERAND (rhs, 0)) == SSA_NAME)
+    {
+      tree ssa_name = TREE_OPERAND (rhs, 0);
+      HOST_WIDE_INT bitsize = tree_to_uhwi (TREE_OPERAND (rhs, 1));
+      HOST_WIDE_INT bitpos = tree_to_uhwi (TREE_OPERAND (rhs, 2));
+
+      hsa_op_reg *imm_value = hsa_cfun->reg_for_gimple_ssa (ssa_name);
+      gen_hsa_insns_for_bitfield (dest, imm_value, bitsize, bitpos, hbb);
+    }
+  else if (DECL_P (rhs) || TREE_CODE (rhs) == MEM_REF
+	   || TREE_CODE (rhs) == TARGET_MEM_REF
+	   || handled_component_p (rhs))
+    {
+      HOST_WIDE_INT bitsize, bitpos;
+
+      /* Load from memory.  */
+      hsa_op_address *addr;
+      addr = gen_hsa_addr (rhs, hbb, &bitsize, &bitpos);
+
+      /* Handle load of a bit field.  */
+      if (bitsize > 64)
+	{
+	  HSA_SORRY_AT (EXPR_LOCATION (rhs),
+			"support for HSA does not implement load from a bit "
+			"field bigger than 64 bits");
+	  return;
+	}
+
+      if (bitsize || bitpos)
+	gen_hsa_insns_for_bitfield_load (dest, addr, bitsize, bitpos, hbb,
+					 hsa_bitmemref_alignment (rhs));
+      else
+	{
+	  BrigType16_t mtype;
+	  /* Not dest->m_type, that's possibly extended.  */
+	  mtype = mem_type_for_type (hsa_type_for_scalar_tree_type (type,
+								    false));
+	  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_LD, mtype, dest,
+						addr);
+	  mem->set_align (hsa_alignment_encoding (get_object_alignment (rhs)));
+	  hbb->append_insn (mem);
+	}
+    }
+  else
+    HSA_SORRY_ATV
+      (EXPR_LOCATION (rhs),
+       "support for HSA does not implement loading of expression %E", rhs);
+}
+
+/* Return number of bits necessary for representation of a bit field,
+   starting at BITPOS with size of BITSIZE.  */
+
+static unsigned
+get_bitfield_size (unsigned bitpos, unsigned bitsize)
+{
+  unsigned s = bitpos + bitsize;
+  unsigned sizes[] = {8, 16, 32, 64};
+
+  for (unsigned i = 0; i < 4; i++)
+    if (s <= sizes[i])
+      return sizes[i];
+
+  gcc_unreachable ();
+  return 0;
+}
+
+/* Generate HSAIL instructions storing into memory.  LHS is the destination of
+   the store, SRC is the source operand.  Add instructions to HBB.  */
+
+static void
+gen_hsa_insns_for_store (tree lhs, hsa_op_base *src, hsa_bb *hbb)
+{
+  HOST_WIDE_INT bitsize = 0, bitpos = 0;
+  BrigAlignment8_t req_align;
+  BrigType16_t mtype;
+  mtype = mem_type_for_type (hsa_type_for_scalar_tree_type (TREE_TYPE (lhs),
+							    false));
+  hsa_op_address *addr;
+  addr = gen_hsa_addr (lhs, hbb, &bitsize, &bitpos);
+
+  /* Handle store to a bit field.  */
+  if (bitsize > 64)
+    {
+      HSA_SORRY_AT (EXPR_LOCATION (lhs),
+		    "support for HSA does not implement store to a bit field "
+		    "bigger than 64 bits");
+      return;
+    }
+
+  unsigned type_bitsize = get_bitfield_size (bitpos, bitsize);
+
+  /* HSAIL does not support MOV insn with 16-bits integers.  */
+  if (type_bitsize < 32)
+    type_bitsize = 32;
+
+  if (bitpos || (bitsize && type_bitsize != bitsize))
+    {
+      unsigned HOST_WIDE_INT mask = 0;
+      BrigType16_t mem_type = get_integer_type_by_bytes
+	(type_bitsize / BITS_PER_UNIT, !TYPE_UNSIGNED (TREE_TYPE (lhs)));
+
+      for (unsigned i = 0; i < type_bitsize; i++)
+	if (i < bitpos || i >= bitpos + bitsize)
+	  mask |= ((unsigned HOST_WIDE_INT)1 << i);
+
+      hsa_op_reg *value_reg = new hsa_op_reg (mem_type);
+
+      req_align = hsa_bitmemref_alignment (lhs);
+      /* Load value from memory.  */
+      hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_LD, mem_type,
+					    value_reg, addr);
+      mem->set_align (req_align);
+      hbb->append_insn (mem);
+
+      /* AND the loaded value with prepared mask.  */
+      hsa_op_reg *cleared_reg = new hsa_op_reg (mem_type);
+
+      hsa_op_immed *c = new hsa_op_immed
+	(mask, get_integer_type_by_bytes (type_bitsize / BITS_PER_UNIT, false));
+
+      hsa_insn_basic *clearing = new hsa_insn_basic
+	(3, BRIG_OPCODE_AND, mem_type, cleared_reg, value_reg, c);
+      hbb->append_insn (clearing);
+
+      /* Shift to left a value that is going to be stored.  */
+      hsa_op_reg *new_value_reg = new hsa_op_reg (mem_type);
+
+      hsa_insn_basic *basic = new hsa_insn_basic (2, BRIG_OPCODE_MOV, mem_type,
+						  new_value_reg, src);
+      hbb->append_insn (basic);
+
+      if (bitpos)
+	{
+	  hsa_op_reg *shifted_value_reg = new hsa_op_reg (mem_type);
+	  c = new hsa_op_immed (bitpos, BRIG_TYPE_U32);
+
+	  hsa_insn_basic *basic = new hsa_insn_basic
+	    (3, BRIG_OPCODE_SHL, mem_type, shifted_value_reg, new_value_reg, c);
+	  hbb->append_insn (basic);
+
+	  new_value_reg = shifted_value_reg;
+	}
+
+      /* OR the prepared value with prepared chunk loaded from memory.  */
+      hsa_op_reg *prepared_reg= new hsa_op_reg (mem_type);
+      basic = new hsa_insn_basic (3, BRIG_OPCODE_OR, mem_type, prepared_reg,
+				  new_value_reg, cleared_reg);
+      hbb->append_insn (basic);
+
+      src = prepared_reg;
+      mtype = mem_type;
+    }
+  else
+    req_align = hsa_alignment_encoding (get_object_alignment (lhs));
+
+  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_ST, mtype, src, addr);
+  mem->set_align (req_align);
+
+  /* XXX The HSAIL verifier has another constraint: if the source
+     is an immediate then it must match the destination type.  If
+     it's a register the low bits will be used for sub-word stores.
+     We're always allocating new operands so we can modify the above
+     in place.  */
+  if (hsa_op_immed *imm = dyn_cast <hsa_op_immed *> (src))
+    {
+      if ((imm->m_type & BRIG_TYPE_PACK_MASK) == BRIG_TYPE_PACK_NONE)
+	imm->m_type = mem->m_type;
+      else
+	{
+	  /* ...and all vector immediates apparently need to be vectors of
+	     unsigned bytes. */
+	  unsigned bs = hsa_type_bit_size (imm->m_type);
+	  gcc_assert (bs == hsa_type_bit_size (mem->m_type));
+	  switch (bs)
+	    {
+	    case 32:
+	      imm->m_type = BRIG_TYPE_U8X4;
+	      break;
+	    case 64:
+	      imm->m_type = BRIG_TYPE_U8X8;
+	      break;
+	    case 128:
+	      imm->m_type = BRIG_TYPE_U8X16;
+	      break;
+	    default:
+	      gcc_unreachable ();
+	    }
+	}
+    }
+
+  hbb->append_insn (mem);
+}
+
+/* Generate memory copy instructions that are going to be used
+   for copying a HSA symbol SRC_SYMBOL (or SRC_REG) to TARGET memory,
+   represented by pointer in a register.  */
+
+static void
+gen_hsa_memory_copy (hsa_bb *hbb, hsa_op_address *target, hsa_op_address *src,
+		     unsigned size)
+{
+  hsa_op_address *addr;
+  hsa_insn_mem *mem;
+
+  unsigned offset = 0;
+
+  while (size)
+    {
+      unsigned s;
+      if (size >= 8)
+	s = 8;
+      else if (size >= 4)
+	s = 4;
+      else if (size >= 2)
+	s = 2;
+      else
+	s = 1;
+
+      BrigType16_t t = get_integer_type_by_bytes (s, false);
+
+      hsa_op_reg *tmp = new hsa_op_reg (t);
+      addr = new hsa_op_address (src->m_symbol, src->m_reg,
+				 src->m_imm_offset + offset);
+      mem = new hsa_insn_mem (BRIG_OPCODE_LD, t, tmp, addr);
+      hbb->append_insn (mem);
+
+      addr = new hsa_op_address (target->m_symbol, target->m_reg,
+				 target->m_imm_offset + offset);
+      mem = new hsa_insn_mem (BRIG_OPCODE_ST, t, tmp, addr);
+      hbb->append_insn (mem);
+      offset += s;
+      size -= s;
+    }
+}
+
+/* Create a memset mask that is created by copying a CONSTANT byte value
+   to an integer of BYTE_SIZE bytes.  */
+
+static unsigned HOST_WIDE_INT
+build_memset_value (unsigned HOST_WIDE_INT constant, unsigned byte_size)
+{
+  HOST_WIDE_INT v = constant;
+
+  for (unsigned i = 1; i < byte_size; i++)
+    v |= constant << (8 * i);
+
+  return v;
+}
+
+/* Generate memory set instructions that are going to be used
+   for setting a CONSTANT byte value to TARGET memory of SIZE bytes.  */
+
+static void
+gen_hsa_memory_set (hsa_bb *hbb, hsa_op_address *target,
+		    unsigned HOST_WIDE_INT constant,
+		    unsigned size)
+{
+  hsa_op_address *addr;
+  hsa_insn_mem *mem;
+
+  unsigned offset = 0;
+
+  while (size)
+    {
+      unsigned s;
+      if (size >= 8)
+	s = 8;
+      else if (size >= 4)
+	s = 4;
+      else if (size >= 2)
+	s = 2;
+      else
+	s = 1;
+
+      addr = new hsa_op_address (target->m_symbol, target->m_reg,
+				 target->m_imm_offset + offset);
+
+      BrigType16_t t = get_integer_type_by_bytes (s, false);
+      HOST_WIDE_INT c = build_memset_value (constant, s);
+
+      mem = new hsa_insn_mem (BRIG_OPCODE_ST, t, new hsa_op_immed (c, t),
+			      addr);
+      hbb->append_insn (mem);
+      offset += s;
+      size -= s;
+    }
+}
+
+/* Generate HSAIL instructions for a single assignment
+   of an empty constructor to an ADDR_LHS.  Constructor is passed as a
+   tree RHS and all instructions are appended to HBB.  */
+
+void
+gen_hsa_ctor_assignment (hsa_op_address *addr_lhs, tree rhs, hsa_bb *hbb)
+{
+  if (vec_safe_length (CONSTRUCTOR_ELTS (rhs)))
+    {
+      HSA_SORRY_AT (EXPR_LOCATION (rhs),
+		    "support for HSA does not implement load from constructor");
+      return;
+    }
+
+  unsigned size = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (rhs)));
+  gen_hsa_memory_set (hbb, addr_lhs, 0, size);
+}
+
+/* Generate HSA instructions for a single assignment of RHS to LHS.
+   HBB is the basic block they will be appended to.  */
+
+static void
+gen_hsa_insns_for_single_assignment (tree lhs, tree rhs, hsa_bb *hbb)
+{
+  if (TREE_CODE (lhs) == SSA_NAME)
+    {
+      hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+      if (hsa_seen_error ())
+	return;
+
+      gen_hsa_insns_for_load (dest, rhs, TREE_TYPE (lhs), hbb);
+    }
+  else if (TREE_CODE (rhs) == SSA_NAME
+	   || (is_gimple_min_invariant (rhs) && TREE_CODE (rhs) != STRING_CST))
+    {
+      /* Store to memory.  */
+      hsa_op_base *src = hsa_reg_or_immed_for_gimple_op (rhs, hbb);
+      if (hsa_seen_error ())
+	return;
+
+      gen_hsa_insns_for_store (lhs, src, hbb);
+    }
+  else
+    {
+      hsa_op_address *addr_lhs = gen_hsa_addr (lhs, hbb);
+
+      if (TREE_CODE (rhs) == CONSTRUCTOR)
+	gen_hsa_ctor_assignment (addr_lhs, rhs, hbb);
+      else
+	{
+	  hsa_op_address *addr_rhs = gen_hsa_addr (rhs, hbb);
+
+	  unsigned size = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (rhs)));
+	  gen_hsa_memory_copy (hbb, addr_lhs, addr_rhs, size);
+	}
+    }
+}
+
+/* Prepend before INSN a load from spill symbol of SPILL_REG.  Return the
+   register into which we loaded.  If this required another register to convert
+   from a B1 type, return it in *PTMP2, otherwise store NULL into it.  We
+   assume we are out of SSA so the returned register does not have its
+   definition set.  */
+
+hsa_op_reg *
+hsa_spill_in (hsa_insn_basic *insn, hsa_op_reg *spill_reg, hsa_op_reg **ptmp2)
+{
+  hsa_symbol *spill_sym = spill_reg->m_spill_sym;
+  hsa_op_reg *reg = new hsa_op_reg (spill_sym->m_type);
+  hsa_op_address *addr = new hsa_op_address (spill_sym);
+
+  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_LD, spill_sym->m_type,
+					reg, addr);
+  hsa_insert_insn_before (mem, insn);
+
+  *ptmp2 = NULL;
+  if (spill_reg->m_type == BRIG_TYPE_B1)
+    {
+      hsa_insn_basic *cvtinsn;
+      *ptmp2 = reg;
+      reg = new hsa_op_reg (spill_reg->m_type);
+
+      cvtinsn = new hsa_insn_cvt (reg, *ptmp2);
+      hsa_insert_insn_before (cvtinsn, insn);
+    }
+  return reg;
+}
+
+/* Append after INSN a store to spill symbol of SPILL_REG.  Return the register
+   from which we stored.  If this required another register to convert to a B1
+   type, return it in *PTMP2, otherwise store NULL into it.  We assume we are
+   out of SSA so the returned register does not have its use updated.  */
+
+hsa_op_reg *
+hsa_spill_out (hsa_insn_basic *insn, hsa_op_reg *spill_reg, hsa_op_reg **ptmp2)
+{
+  hsa_symbol *spill_sym = spill_reg->m_spill_sym;
+  hsa_op_reg *reg = new hsa_op_reg (spill_sym->m_type);
+  hsa_op_address *addr = new hsa_op_address (spill_sym);
+  hsa_op_reg *returnreg;
+
+  *ptmp2 = NULL;
+  returnreg = reg;
+  if (spill_reg->m_type == BRIG_TYPE_B1)
+    {
+      hsa_insn_basic *cvtinsn;
+      *ptmp2 = new hsa_op_reg (spill_sym->m_type);
+      reg->m_type = spill_reg->m_type;
+
+      cvtinsn = new hsa_insn_cvt (*ptmp2, returnreg);
+      hsa_append_insn_after (cvtinsn, insn);
+      insn = cvtinsn;
+      reg = *ptmp2;
+    }
+
+  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_ST, spill_sym->m_type, reg,
+					addr);
+  hsa_append_insn_after (mem, insn);
+  return returnreg;
+}
+
+/* Generate a comparison instruction that will compare LHS and RHS with
+   comparison specified by CODE and put result into register DEST.  DEST has to
+   have its type set already but must not have its definition set yet.
+   Generated instructions will be added to HBB.  */
+
+static void
+gen_hsa_cmp_insn_from_gimple (enum tree_code code, tree lhs, tree rhs,
+			      hsa_op_reg *dest, hsa_bb *hbb)
+{
+  BrigCompareOperation8_t compare;
+
+  switch (code)
+    {
+    case LT_EXPR:
+      compare = BRIG_COMPARE_LT;
+      break;
+    case LE_EXPR:
+      compare = BRIG_COMPARE_LE;
+      break;
+    case GT_EXPR:
+      compare = BRIG_COMPARE_GT;
+      break;
+    case GE_EXPR:
+      compare = BRIG_COMPARE_GE;
+      break;
+    case EQ_EXPR:
+      compare = BRIG_COMPARE_EQ;
+      break;
+    case NE_EXPR:
+      compare = BRIG_COMPARE_NE;
+      break;
+    case UNORDERED_EXPR:
+      compare = BRIG_COMPARE_NAN;
+      break;
+    case ORDERED_EXPR:
+      compare = BRIG_COMPARE_NUM;
+      break;
+    case UNLT_EXPR:
+      compare = BRIG_COMPARE_LTU;
+      break;
+    case UNLE_EXPR:
+      compare = BRIG_COMPARE_LEU;
+      break;
+    case UNGT_EXPR:
+      compare = BRIG_COMPARE_GTU;
+      break;
+    case UNGE_EXPR:
+      compare = BRIG_COMPARE_GEU;
+      break;
+    case UNEQ_EXPR:
+      compare = BRIG_COMPARE_EQU;
+      break;
+    case LTGT_EXPR:
+      compare = BRIG_COMPARE_NEU;
+      break;
+
+    default:
+      HSA_SORRY_ATV (EXPR_LOCATION (lhs),
+		     "support for HSA does not implement comparison tree "
+		     "code %s\n", get_tree_code_name (code));
+      return;
+    }
+
+  hsa_insn_cmp *cmp = new hsa_insn_cmp (compare, dest->m_type);
+  cmp->set_op (0, dest);
+  cmp->set_op (1, hsa_reg_or_immed_for_gimple_op (lhs, hbb));
+  cmp->set_op (2, hsa_reg_or_immed_for_gimple_op (rhs, hbb));
+  hbb->append_insn (cmp);
+}
+
+/* Generate an unary instruction with OPCODE and append it to a basic block
+   HBB.  The instruction uses DEST as a destination and OP1
+   as a single operand.  */
+
+static void
+gen_hsa_unary_operation (int opcode, hsa_op_reg *dest,
+			 hsa_op_with_type *op1, hsa_bb *hbb)
+{
+  gcc_checking_assert (dest);
+  hsa_insn_basic *insn;
+
+  if (opcode == BRIG_OPCODE_MOV && hsa_needs_cvt (dest->m_type, op1->m_type))
+      insn = new hsa_insn_cvt (dest, op1);
+  else
+    {
+      insn = new hsa_insn_basic (2, opcode, dest->m_type, dest, op1);
+
+      if (opcode == BRIG_OPCODE_ABS || opcode == BRIG_OPCODE_NEG)
+	{
+	  /* ABS and NEG only exist in _s form :-/  */
+	  if (insn->m_type == BRIG_TYPE_U32)
+	    insn->m_type = BRIG_TYPE_S32;
+	  else if (insn->m_type == BRIG_TYPE_U64)
+	    insn->m_type = BRIG_TYPE_S64;
+	}
+    }
+
+  hbb->append_insn (insn);
+}
+
+/* Generate a binary instruction with OPCODE and append it to a basic block
+   HBB.  The instruction uses DEST as a destination and operands OP1
+   and OP2.  */
+
+static void
+gen_hsa_binary_operation (int opcode, hsa_op_reg *dest,
+			  hsa_op_base *op1, hsa_op_base *op2, hsa_bb *hbb)
+{
+  gcc_checking_assert (dest);
+
+  if ((opcode == BRIG_OPCODE_SHL || opcode == BRIG_OPCODE_SHR)
+      && is_a <hsa_op_immed *> (op2))
+    {
+      hsa_op_immed *i = dyn_cast <hsa_op_immed *> (op2);
+      i->set_type (BRIG_TYPE_U32);
+    }
+
+  hsa_insn_basic *insn = new hsa_insn_basic (3, opcode, dest->m_type, dest,
+					     op1, op2);
+  hbb->append_insn (insn);
+}
+
+/* Generate HSA instructions for a single assignment.  HBB is the basic block
+   they will be appended to.  */
+
+static void
+gen_hsa_insns_for_operation_assignment (gimple *assign, hsa_bb *hbb)
+{
+  tree_code code = gimple_assign_rhs_code (assign);
+  gimple_rhs_class rhs_class = get_gimple_rhs_class (gimple_expr_code (assign));
+
+  tree lhs = gimple_assign_lhs (assign);
+  tree rhs1 = gimple_assign_rhs1 (assign);
+  tree rhs2 = gimple_assign_rhs2 (assign);
+  tree rhs3 = gimple_assign_rhs3 (assign);
+
+  int opcode;
+
+  switch (code)
+    {
+    CASE_CONVERT:
+    case FLOAT_EXPR:
+      /* The opcode is changed to BRIG_OPCODE_CVT if BRIG types
+	 needs a conversion.  */
+      opcode = BRIG_OPCODE_MOV;
+      break;
+
+    case PLUS_EXPR:
+    case POINTER_PLUS_EXPR:
+      opcode = BRIG_OPCODE_ADD;
+      break;
+    case MINUS_EXPR:
+      opcode = BRIG_OPCODE_SUB;
+      break;
+    case MULT_EXPR:
+      opcode = BRIG_OPCODE_MUL;
+      break;
+    case MULT_HIGHPART_EXPR:
+      opcode = BRIG_OPCODE_MULHI;
+      break;
+    case RDIV_EXPR:
+    case TRUNC_DIV_EXPR:
+    case EXACT_DIV_EXPR:
+      opcode = BRIG_OPCODE_DIV;
+      break;
+    case CEIL_DIV_EXPR:
+    case FLOOR_DIV_EXPR:
+    case ROUND_DIV_EXPR:
+      HSA_SORRY_AT (gimple_location (assign),
+		    "support for HSA does not implement CEIL_DIV_EXPR, "
+		    "FLOOR_DIV_EXPR or ROUND_DIV_EXPR");
+      return;
+    case TRUNC_MOD_EXPR:
+      opcode = BRIG_OPCODE_REM;
+      break;
+    case CEIL_MOD_EXPR:
+    case FLOOR_MOD_EXPR:
+    case ROUND_MOD_EXPR:
+      HSA_SORRY_AT (gimple_location (assign),
+		    "support for HSA does not implement CEIL_MOD_EXPR, "
+		    "FLOOR_MOD_EXPR or ROUND_MOD_EXPR");
+      return;
+    case NEGATE_EXPR:
+      opcode = BRIG_OPCODE_NEG;
+      break;
+    case MIN_EXPR:
+      opcode = BRIG_OPCODE_MIN;
+      break;
+    case MAX_EXPR:
+      opcode = BRIG_OPCODE_MAX;
+      break;
+    case ABS_EXPR:
+      opcode = BRIG_OPCODE_ABS;
+      break;
+    case LSHIFT_EXPR:
+      opcode = BRIG_OPCODE_SHL;
+      break;
+    case RSHIFT_EXPR:
+      opcode = BRIG_OPCODE_SHR;
+      break;
+    case LROTATE_EXPR:
+    case RROTATE_EXPR:
+      {
+	hsa_insn_basic *insn = NULL;
+	int code1 = code == LROTATE_EXPR ? BRIG_OPCODE_SHL : BRIG_OPCODE_SHR;
+	int code2 = code != LROTATE_EXPR ? BRIG_OPCODE_SHL : BRIG_OPCODE_SHR;
+	BrigType16_t btype = hsa_type_for_scalar_tree_type (TREE_TYPE (lhs),
+							    true);
+
+	hsa_op_with_type *src = hsa_reg_or_immed_for_gimple_op (rhs1, hbb);
+	hsa_op_reg *op1 = new hsa_op_reg (btype);
+	hsa_op_reg *op2 = new hsa_op_reg (btype);
+	hsa_op_with_type *shift1 = hsa_reg_or_immed_for_gimple_op (rhs2, hbb);
+
+	tree type = TREE_TYPE (rhs2);
+	unsigned HOST_WIDE_INT bitsize = TREE_INT_CST_LOW (TYPE_SIZE (type));
+
+	hsa_op_with_type *shift2 = NULL;
+	if (TREE_CODE (rhs2) == INTEGER_CST)
+	  shift2 = new hsa_op_immed (bitsize - tree_to_uhwi (rhs2),
+				     BRIG_TYPE_U32);
+	else if (TREE_CODE (rhs2) == SSA_NAME)
+	  {
+	    hsa_op_reg *s = hsa_cfun->reg_for_gimple_ssa (rhs2);
+	    hsa_op_reg *d = new hsa_op_reg (s->m_type);
+	    hsa_op_immed *size_imm = new hsa_op_immed (bitsize, BRIG_TYPE_U32);
+
+	    insn = new hsa_insn_basic (3, BRIG_OPCODE_SUB, d->m_type,
+				       d, s, size_imm);
+	    hbb->append_insn (insn);
+
+	    shift2 = d;
+	  }
+	else
+	  gcc_unreachable ();
+
+	hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+	gen_hsa_binary_operation (code1, op1, src, shift1, hbb);
+	gen_hsa_binary_operation (code2, op2, src, shift2, hbb);
+	gen_hsa_binary_operation (BRIG_OPCODE_OR, dest, op1, op2, hbb);
+
+	return;
+      }
+    case BIT_IOR_EXPR:
+      opcode = BRIG_OPCODE_OR;
+      break;
+    case BIT_XOR_EXPR:
+      opcode = BRIG_OPCODE_XOR;
+      break;
+    case BIT_AND_EXPR:
+      opcode = BRIG_OPCODE_AND;
+      break;
+    case BIT_NOT_EXPR:
+      opcode = BRIG_OPCODE_NOT;
+      break;
+    case FIX_TRUNC_EXPR:
+      {
+	hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+	hsa_op_with_type *v = hsa_reg_or_immed_for_gimple_op (rhs1, hbb);
+
+	if (hsa_needs_cvt (dest->m_type, v->m_type))
+	  {
+	    hsa_op_reg *tmp = new hsa_op_reg (v->m_type);
+
+	    hsa_insn_basic *insn = new hsa_insn_basic (2, BRIG_OPCODE_TRUNC,
+						       tmp->m_type, tmp, v);
+	    hbb->append_insn (insn);
+
+	    hsa_insn_basic *cvtinsn = new hsa_insn_cvt (dest, tmp);
+	    hbb->append_insn (cvtinsn);
+	  }
+	else
+	  {
+	    hsa_insn_basic *insn = new hsa_insn_basic (2, BRIG_OPCODE_TRUNC,
+						       dest->m_type, dest, v);
+	    hbb->append_insn (insn);
+	  }
+
+	return;
+      }
+      opcode = BRIG_OPCODE_TRUNC;
+      break;
+
+    case LT_EXPR:
+    case LE_EXPR:
+    case GT_EXPR:
+    case GE_EXPR:
+    case EQ_EXPR:
+    case NE_EXPR:
+    case UNORDERED_EXPR:
+    case ORDERED_EXPR:
+    case UNLT_EXPR:
+    case UNLE_EXPR:
+    case UNGT_EXPR:
+    case UNGE_EXPR:
+    case UNEQ_EXPR:
+    case LTGT_EXPR:
+      {
+	hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa
+	  (gimple_assign_lhs (assign));
+
+	gen_hsa_cmp_insn_from_gimple (code, rhs1, rhs2, dest, hbb);
+	return;
+      }
+    case COND_EXPR:
+      {
+	hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa
+	  (gimple_assign_lhs (assign));
+	hsa_op_with_type *ctrl = NULL;
+	tree cond = rhs1;
+
+	if (CONSTANT_CLASS_P (cond) || TREE_CODE (cond) == SSA_NAME)
+	  ctrl = hsa_reg_or_immed_for_gimple_op (cond, hbb);
+	else
+	  {
+	    hsa_op_reg *r = new hsa_op_reg (BRIG_TYPE_B1);
+
+	    gen_hsa_cmp_insn_from_gimple (TREE_CODE (cond),
+				  TREE_OPERAND (cond, 0),
+				  TREE_OPERAND (cond, 1),
+				  r, hbb);
+
+	    ctrl = r;
+	  }
+
+	hsa_op_with_type *rhs2_reg = hsa_reg_or_immed_for_gimple_op
+	  (rhs2, hbb);
+	hsa_op_with_type *rhs3_reg = hsa_reg_or_immed_for_gimple_op
+	  (rhs3, hbb);
+
+	BrigType16_t btype = hsa_bittype_for_type (dest->m_type);
+	hsa_op_reg *tmp = new hsa_op_reg (btype);
+
+	rhs2_reg->m_type = btype;
+	rhs3_reg->m_type = btype;
+
+	hsa_insn_basic *insn = new hsa_insn_basic
+	  (4, BRIG_OPCODE_CMOV, tmp->m_type, tmp, ctrl, rhs2_reg, rhs3_reg);
+
+	hbb->append_insn (insn);
+
+	/* As operands of a CMOV insn must be Bx types, we have to emit
+	   a conversion insn.  */
+	hsa_insn_basic *mov = new hsa_insn_basic (2, BRIG_OPCODE_MOV,
+						  dest->m_type, dest, tmp);
+	hbb->append_insn (mov);
+
+	return;
+      }
+    case COMPLEX_EXPR:
+      {
+	hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa
+	  (gimple_assign_lhs (assign));
+	hsa_op_with_type *rhs1_reg = hsa_reg_or_immed_for_gimple_op (rhs1, hbb);
+	hsa_op_with_type *rhs2_reg = hsa_reg_or_immed_for_gimple_op (rhs2, hbb);
+
+	if (hsa_seen_error ())
+	  return;
+
+	BrigType16_t src_type = hsa_bittype_for_type (rhs1_reg->m_type);
+	rhs1_reg = rhs1_reg->get_in_type (src_type, hbb);
+	rhs2_reg = rhs2_reg->get_in_type (src_type, hbb);
+
+	hsa_insn_packed *insn = new hsa_insn_packed
+	  (3, BRIG_OPCODE_COMBINE, dest->m_type, src_type, dest,
+	   rhs1_reg, rhs2_reg);
+	hbb->append_insn (insn);
+
+	return;
+      }
+    default:
+      /* Implement others as we come across them.  */
+      HSA_SORRY_ATV (gimple_location (assign),
+		     "support for HSA does not implement operation %s",
+		     get_tree_code_name (code));
+      return;
+    }
+
+
+  hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (gimple_assign_lhs (assign));
+
+  hsa_op_with_type *op1 = hsa_reg_or_immed_for_gimple_op (rhs1, hbb);
+  hsa_op_with_type *op2 = rhs2 != NULL_TREE ?
+    hsa_reg_or_immed_for_gimple_op (rhs2, hbb) : NULL;
+
+  if (hsa_seen_error ())
+    return;
+
+  switch (rhs_class)
+    {
+    case GIMPLE_TERNARY_RHS:
+      gcc_unreachable ();
+      return;
+
+      /* Fall through */
+    case GIMPLE_BINARY_RHS:
+      gen_hsa_binary_operation (opcode, dest, op1, op2, hbb);
+      break;
+      /* Fall through */
+    case GIMPLE_UNARY_RHS:
+      gen_hsa_unary_operation (opcode, dest, op1, hbb);
+      break;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Generate HSA instructions for a given gimple condition statement COND.
+   Instructions will be appended to HBB, which also needs to be the
+   corresponding structure to the basic_block of COND.  */
+
+static void
+gen_hsa_insns_for_cond_stmt (gimple *cond, hsa_bb *hbb)
+{
+  hsa_op_reg *ctrl = new hsa_op_reg (BRIG_TYPE_B1);
+  hsa_insn_br *cbr;
+
+  gen_hsa_cmp_insn_from_gimple (gimple_cond_code (cond),
+				gimple_cond_lhs (cond),
+				gimple_cond_rhs (cond),
+				ctrl, hbb);
+
+  cbr = new hsa_insn_br (ctrl);
+  hbb->append_insn (cbr);
+}
+
+/* Maximum number of elements in a jump table for an HSA SBR instruction.  */
+
+#define HSA_MAXIMUM_SBR_LABELS	16
+
+/* Return lowest value of a switch S that is handled in a non-default
+   label.  */
+
+static tree
+get_switch_low (gswitch *s)
+{
+  unsigned labels = gimple_switch_num_labels (s);
+  gcc_checking_assert (labels >= 1);
+
+  return CASE_LOW (gimple_switch_label (s, 1));
+}
+
+/* Return highest value of a switch S that is handled in a non-default
+   label.  */
+
+static tree
+get_switch_high (gswitch *s)
+{
+  unsigned labels = gimple_switch_num_labels (s);
+
+  /* Compare last label to maximum number of labels.  */
+  tree label = gimple_switch_label (s, labels - 1);
+  tree low = CASE_LOW (label);
+  tree high = CASE_HIGH (label);
+
+  return high != NULL_TREE ? high : low;
+}
+
+static tree
+get_switch_size (gswitch *s)
+{
+  return int_const_binop (MINUS_EXPR, get_switch_high (s), get_switch_low (s));
+}
+
+/* Generate HSA instructions for a given gimple switch.
+   Instructions will be appended to HBB.  */
+
+static void
+gen_hsa_insns_for_switch_stmt (gswitch *s, hsa_bb *hbb)
+{
+  function *func = DECL_STRUCT_FUNCTION (current_function_decl);
+  tree index_tree = gimple_switch_index (s);
+  tree lowest = get_switch_low (s);
+
+  hsa_op_reg *index = hsa_cfun->reg_for_gimple_ssa (index_tree);
+  hsa_op_reg *sub_index = new hsa_op_reg (index->m_type);
+  hbb->append_insn (new hsa_insn_basic (3, BRIG_OPCODE_SUB, sub_index->m_type,
+					sub_index, index,
+					new hsa_op_immed (lowest)));
+
+  hsa_op_base *tmp = sub_index->get_in_type (BRIG_TYPE_U64, hbb);
+  sub_index = as_a <hsa_op_reg *> (tmp);
+  unsigned labels = gimple_switch_num_labels (s);
+  unsigned HOST_WIDE_INT size = tree_to_uhwi (get_switch_size (s));
+
+  hsa_insn_sbr *sbr = new hsa_insn_sbr (sub_index, size + 1);
+  tree default_label = gimple_switch_default_label (s);
+  basic_block default_label_bb = label_to_block_fn
+    (func, CASE_LABEL (default_label));
+
+  sbr->m_default_bb = default_label_bb;
+
+  /* Prepare array with default label destination.  */
+  for (unsigned HOST_WIDE_INT i = 0; i <= size; i++)
+    sbr->m_jump_table.safe_push (default_label_bb);
+
+  /* Iterate all labels and fill up the jump table.  */
+  for (unsigned i = 1; i < labels; i++)
+    {
+      tree label = gimple_switch_label (s, i);
+      basic_block bb = label_to_block_fn (func, CASE_LABEL (label));
+
+      unsigned HOST_WIDE_INT sub_low = tree_to_uhwi
+	(int_const_binop (MINUS_EXPR, CASE_LOW (label), lowest));
+
+      unsigned HOST_WIDE_INT sub_high = sub_low;
+      tree high = CASE_HIGH (label);
+      if (high != NULL)
+	sub_high = tree_to_uhwi (int_const_binop (MINUS_EXPR, high, lowest));
+
+      for (unsigned HOST_WIDE_INT j = sub_low; j <= sub_high; j++)
+	sbr->m_jump_table[j] = bb;
+    }
+
+  hbb->append_insn (sbr);
+}
+
+/* Verify that the function DECL can be handled by HSA.  */
+
+static void
+verify_function_arguments (tree decl)
+{
+  if (DECL_STATIC_CHAIN (decl))
+    {
+      HSA_SORRY_ATV (EXPR_LOCATION (decl),
+		     "HSA does not support nested functions: %D", decl);
+      return;
+    }
+  else if (!TYPE_ARG_TYPES (TREE_TYPE (decl)))
+    {
+      HSA_SORRY_ATV (EXPR_LOCATION (decl),
+		     "HSA does not support functions with variadic arguments "
+		     "(or unknown return type): %D", decl);
+      return;
+    }
+}
+
+/* Return BRIG type for FORMAL_ARG_TYPE.  If the formal argument type is NULL,
+   return ACTUAL_ARG_TYPE.  */
+
+static BrigType16_t
+get_format_argument_type (tree formal_arg_type, BrigType16_t actual_arg_type)
+{
+  if (formal_arg_type == NULL)
+    return actual_arg_type;
+
+  BrigType16_t decl_type = hsa_type_for_scalar_tree_type
+    (formal_arg_type, false);
+  return mem_type_for_type (decl_type);
+}
+
+/* Generate HSA instructions for a direct call instruction.
+   Instructions will be appended to HBB, which also needs to be the
+   corresponding structure to the basic_block of STMT.  */
+
+static void
+gen_hsa_insns_for_direct_call (gimple *stmt, hsa_bb *hbb)
+{
+  tree decl = gimple_call_fndecl (stmt);
+  verify_function_arguments (decl);
+  if (hsa_seen_error ())
+    return;
+
+  hsa_insn_call *call_insn = new hsa_insn_call (decl);
+  hsa_cfun->m_called_functions.safe_push (call_insn->m_called_function);
+
+  /* Argument block start.  */
+  hsa_insn_arg_block *arg_start = new hsa_insn_arg_block
+    (BRIG_KIND_DIRECTIVE_ARG_BLOCK_START, call_insn);
+  hbb->append_insn (arg_start);
+
+  tree parm_type_chain = TYPE_ARG_TYPES (gimple_call_fntype (stmt));
+
+  /* Preparation of arguments that will be passed to function.  */
+  const unsigned args = gimple_call_num_args (stmt);
+  for (unsigned i = 0; i < args; ++i)
+    {
+      tree parm = gimple_call_arg (stmt, (int)i);
+      tree parm_decl_type = parm_type_chain != NULL_TREE
+	? TREE_VALUE (parm_type_chain) : NULL_TREE;
+      hsa_op_address *addr;
+
+      if (AGGREGATE_TYPE_P (TREE_TYPE (parm)))
+	{
+	  addr = gen_hsa_addr_for_arg (TREE_TYPE (parm), i);
+	  hsa_op_address *src = gen_hsa_addr (parm, hbb);
+	  gen_hsa_memory_copy (hbb, addr, src,
+			       addr->m_symbol->total_byte_size ());
+	}
+      else
+	{
+	  hsa_op_with_type *src = hsa_reg_or_immed_for_gimple_op (parm, hbb);
+
+	  if (parm_decl_type != NULL && AGGREGATE_TYPE_P (parm_decl_type))
+	    {
+	      HSA_SORRY_AT (gimple_location (stmt),
+			    "support for HSA does not implement an aggregate "
+			    "formal argument in a function call, while actual "
+			    "argument is not an aggregate");
+	      return;
+	    }
+
+	  BrigType16_t formal_arg_type = get_format_argument_type
+	    (parm_decl_type, src->m_type);
+	  if (hsa_seen_error ())
+	    return;
+
+	  if (src->m_type != formal_arg_type)
+	    src = src->get_in_type (formal_arg_type, hbb);
+
+	  addr = gen_hsa_addr_for_arg
+	    (parm_decl_type != NULL_TREE ? parm_decl_type: TREE_TYPE (parm), i);
+	  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_ST, formal_arg_type,
+						src, addr);
+
+	  hbb->append_insn (mem);
+	}
+
+      call_insn->m_input_args.safe_push (addr->m_symbol);
+      if (parm_type_chain)
+	parm_type_chain = TREE_CHAIN (parm_type_chain);
+    }
+
+  call_insn->m_args_code_list = new hsa_op_code_list (args);
+  hbb->append_insn (call_insn);
+
+  tree result_type = TREE_TYPE (TREE_TYPE (decl));
+
+  tree result = gimple_call_lhs (stmt);
+  hsa_insn_mem *result_insn = NULL;
+  if (!VOID_TYPE_P (result_type))
+    {
+      hsa_op_address *addr = gen_hsa_addr_for_arg (result_type, -1);
+
+      /* Even if result of a function call is unused, we have to emit
+	 declaration for the result.  */
+      if (result)
+	{
+	  tree lhs_type = TREE_TYPE (result);
+
+	  if (hsa_seen_error ())
+	    return;
+
+	  if (AGGREGATE_TYPE_P (lhs_type))
+	    {
+	      hsa_op_address *result_addr = gen_hsa_addr (result, hbb);
+	      gen_hsa_memory_copy (hbb, result_addr, addr,
+				   addr->m_symbol->total_byte_size ());
+	    }
+	  else
+	    {
+	      BrigType16_t mtype = mem_type_for_type
+		(hsa_type_for_scalar_tree_type (lhs_type, false));
+
+	      hsa_op_reg *dst = hsa_cfun->reg_for_gimple_ssa (result);
+	      result_insn = new hsa_insn_mem (BRIG_OPCODE_LD, mtype, dst, addr);
+	      hbb->append_insn (result_insn);
+	    }
+	}
+
+      call_insn->m_output_arg = addr->m_symbol;
+      call_insn->m_result_code_list = new hsa_op_code_list (1);
+    }
+  else
+    {
+      if (result)
+	{
+	  HSA_SORRY_AT (gimple_location (stmt),
+			"support for HSA does not implement an assignment of "
+			"return value from a void function");
+	  return;
+	}
+
+      call_insn->m_result_code_list = new hsa_op_code_list (0);
+    }
+
+  /* Argument block start.  */
+  hsa_insn_arg_block *arg_end = new hsa_insn_arg_block
+    (BRIG_KIND_DIRECTIVE_ARG_BLOCK_END, call_insn);
+  hbb->append_insn (arg_end);
+}
+
+/* Generate HSA instructions for a return value instruction.
+   Instructions will be appended to HBB, which also needs to be the
+   corresponding structure to the basic_block of STMT.  */
+
+static void
+gen_hsa_insns_for_return (greturn *stmt, hsa_bb *hbb)
+{
+  tree retval = gimple_return_retval (stmt);
+  if (retval)
+    {
+      hsa_op_address *addr = new hsa_op_address (hsa_cfun->m_output_arg);
+
+      if (AGGREGATE_TYPE_P (TREE_TYPE (retval)))
+	{
+	  hsa_op_address *retval_addr = gen_hsa_addr (retval, hbb);
+	  gen_hsa_memory_copy (hbb, addr, retval_addr,
+			       hsa_cfun->m_output_arg->total_byte_size ());
+	}
+      else
+	{
+	  BrigType16_t mtype = mem_type_for_type
+	    (hsa_type_for_scalar_tree_type (TREE_TYPE (retval), false));
+
+	  /* Store of return value.  */
+	  hsa_op_with_type *src = hsa_reg_or_immed_for_gimple_op (retval, hbb);
+	  src = src->get_in_type (mtype, hbb);
+	  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_ST, mtype, src,
+						addr);
+	  hbb->append_insn (mem);
+	}
+    }
+
+  /* HSAIL return instruction emission.  */
+  hsa_insn_basic *ret = new hsa_insn_basic (0, BRIG_OPCODE_RET);
+  hbb->append_insn (ret);
+}
+
+/* Set OP_INDEX-th operand of the instruction to DEST, as the DEST
+   can have a different type, conversion instructions are possibly
+   appended to HBB.  */
+
+void
+hsa_insn_basic::set_output_in_type (hsa_op_reg *dest, unsigned op_index,
+				    hsa_bb *hbb)
+{
+  hsa_insn_basic *insn;
+  gcc_checking_assert (op_output_p (op_index));
+
+  if (dest->m_type == m_type)
+    set_op (op_index, dest);
+
+  hsa_op_reg *tmp = new hsa_op_reg (m_type);
+  set_op (op_index, tmp);
+
+  if (hsa_needs_cvt (dest->m_type, m_type))
+    insn = new hsa_insn_cvt (dest, tmp);
+  else
+    insn = new hsa_insn_basic (2, BRIG_OPCODE_MOV, dest->m_type,
+			       dest, tmp->get_in_type (dest->m_type, hbb));
+
+  hbb->append_insn (insn);
+}
+
+/* Generate instruction OPCODE to query a property of HSA grid along the
+   given DIMENSION.  Store result into DEST and append the instruction to
+   HBB.  */
+
+static void
+query_hsa_grid (hsa_op_reg *dest, BrigType16_t opcode, int dimension,
+		hsa_bb *hbb)
+{
+  /* We're using just one-dimensional kernels, so hard-coded
+     dimension X.  */
+  hsa_op_immed *imm = new hsa_op_immed (dimension,
+					(BrigKind16_t) BRIG_TYPE_U32);
+  hsa_insn_basic *insn = new hsa_insn_basic (2, opcode, BRIG_TYPE_U32, NULL,
+					     imm);
+  hbb->append_insn (insn);
+  insn->set_output_in_type (dest, 0, hbb);
+}
+
+/* Generate a special HSA-related instruction for gimple STMT.
+   Instructions are appended to basic block HBB.  */
+
+static void
+query_hsa_grid (gimple *stmt, BrigOpcode16_t opcode, int dimension,
+		hsa_bb *hbb)
+{
+  tree lhs = gimple_call_lhs (dyn_cast <gcall *> (stmt));
+  if (lhs == NULL_TREE)
+    return;
+
+  hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+
+  query_hsa_grid (dest, opcode, dimension, hbb);
+}
+
+/* Emit instructions that set hsa_num_threads according to provided VALUE.
+   Instructions are appended to basic block HBB.  */
+
+static void
+gen_set_num_threads (tree value, hsa_bb *hbb)
+{
+  hbb->append_insn (new hsa_insn_comment ("omp_set_num_threads"));
+  hsa_op_with_type *src = hsa_reg_or_immed_for_gimple_op (value, hbb);
+
+  src = src->get_in_type (hsa_num_threads->m_type, hbb);
+  hsa_op_address *addr = new hsa_op_address (hsa_num_threads);
+
+  hsa_insn_basic *basic = new hsa_insn_mem
+    (BRIG_OPCODE_ST, hsa_num_threads->m_type, src, addr);
+  hbb->append_insn (basic);
+}
+
+/* Return an HSA register that will contain number of threads for
+   a future dispatched kernel.  Instructions are added to HBB.  */
+
+static hsa_op_reg *
+gen_num_threads_for_dispatch (hsa_bb *hbb)
+{
+  /* Step 1) Assign to number of threads:
+     MIN (HSA_DEFAULT_NUM_THREADS, hsa_num_threads).  */
+  hsa_op_reg *threads = new hsa_op_reg (hsa_num_threads->m_type);
+  hsa_op_address *addr = new hsa_op_address (hsa_num_threads);
+
+  hbb->append_insn (new hsa_insn_mem (BRIG_OPCODE_LD, threads->m_type,
+				      threads, addr));
+
+  hsa_op_immed *limit = new hsa_op_immed (HSA_DEFAULT_NUM_THREADS,
+					  BRIG_TYPE_U32);
+  hsa_op_reg *r = new hsa_op_reg (BRIG_TYPE_B1);
+  hbb->append_insn
+    (new hsa_insn_cmp (BRIG_COMPARE_LT, r->m_type, r, threads, limit));
+
+  BrigType16_t btype = hsa_bittype_for_type (threads->m_type);
+  hsa_op_reg *tmp = new hsa_op_reg (threads->m_type);
+
+  hbb->append_insn
+    (new hsa_insn_basic (4, BRIG_OPCODE_CMOV, btype, tmp, r,
+			 threads, limit));
+
+  /* Step 2) If the number is equal to zero,
+     return shadow->omp_num_threads.  */
+  hsa_op_reg *shadow_reg_ptr = hsa_cfun->get_shadow_reg ();
+
+  hsa_op_reg *shadow_thread_count = new hsa_op_reg (BRIG_TYPE_U32);
+  addr = new hsa_op_address
+   (shadow_reg_ptr, offsetof (hsa_kernel_dispatch, omp_num_threads));
+  hsa_insn_basic *basic = new hsa_insn_mem
+   (BRIG_OPCODE_LD, shadow_thread_count->m_type, shadow_thread_count, addr);
+  hbb->append_insn (basic);
+
+  hsa_op_reg *tmp2 = new hsa_op_reg (threads->m_type);
+  r = new hsa_op_reg (BRIG_TYPE_B1);
+  hbb->append_insn
+    (new hsa_insn_cmp (BRIG_COMPARE_EQ, r->m_type, r, tmp,
+		       new hsa_op_immed (0, shadow_thread_count->m_type)));
+  hbb->append_insn
+    (new hsa_insn_basic (4, BRIG_OPCODE_CMOV, btype, tmp2, r,
+			 shadow_thread_count, tmp));
+
+  hsa_op_base *dest = tmp2->get_in_type (BRIG_TYPE_U16, hbb);
+
+  return as_a <hsa_op_reg *> (dest);
+}
+
+
+/* Emit instructions that assign number of teams to lhs of gimple STMT.
+   Instructions are appended to basic block HBB.  */
+
+static void
+gen_get_num_teams (gimple *stmt, hsa_bb *hbb)
+{
+  if (gimple_call_lhs (stmt) == NULL_TREE)
+    return;
+
+  hbb->append_insn
+    (new hsa_insn_comment ("__builtin_omp_get_num_teams"));
+
+  tree lhs = gimple_call_lhs (stmt);
+  hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+  hsa_op_immed *one = new hsa_op_immed (1, dest->m_type);
+
+  hsa_insn_basic *basic = new hsa_insn_basic
+    (2, BRIG_OPCODE_MOV, dest->m_type, dest, one);
+
+  hbb->append_insn (basic);
+}
+
+/* Emit instructions that assign a team number to lhs of gimple STMT.
+   Instructions are appended to basic block HBB.  */
+
+static void
+gen_get_team_num (gimple *stmt, hsa_bb *hbb)
+{
+  if (gimple_call_lhs (stmt) == NULL_TREE)
+    return;
+
+  hbb->append_insn
+    (new hsa_insn_comment ("__builtin_omp_get_team_num"));
+
+  tree lhs = gimple_call_lhs (stmt);
+  hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+  hsa_op_immed *zero = new hsa_op_immed (0, dest->m_type);
+
+  hsa_insn_basic *basic = new hsa_insn_basic
+    (2, BRIG_OPCODE_MOV, dest->m_type, dest, zero);
+
+  hbb->append_insn (basic);
+}
+
+/* Set VALUE to a shadow kernel debug argument and append a new instruction
+   to HBB basic block.  */
+
+static void
+set_debug_value (hsa_bb *hbb, hsa_op_with_type *value)
+{
+  hsa_op_reg *shadow_reg_ptr = hsa_cfun->get_shadow_reg ();
+  if (shadow_reg_ptr == NULL)
+    return;
+
+  hsa_op_address *addr = new hsa_op_address
+    (shadow_reg_ptr, offsetof (hsa_kernel_dispatch, debug));
+  hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U64, value,
+					addr);
+  hbb->append_insn (mem);
+}
+
+/* If STMT is a call of a known library function, generate code to perform
+   it and return true.  */
+
+static bool
+gen_hsa_insns_for_known_library_call (gimple *stmt, hsa_bb *hbb)
+{
+  const char *name = hsa_get_declaration_name (gimple_call_fndecl (stmt));
+
+  if (strcmp (name, "omp_is_initial_device") == 0)
+    {
+      tree lhs = gimple_call_lhs (stmt);
+      if (!lhs)
+	return true;
+
+      hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+      hsa_op_immed *imm = new hsa_op_immed (build_zero_cst (TREE_TYPE (lhs)));
+
+      hsa_build_append_simple_mov (dest, imm, hbb);
+    }
+  else if (strcmp (name, "omp_set_num_threads") == 0)
+    gen_set_num_threads (gimple_call_arg (stmt, 0), hbb);
+  else if (strcmp (name, "omp_get_num_threads") == 0)
+    query_hsa_grid (stmt, BRIG_OPCODE_GRIDSIZE, 0, hbb);
+  else if (strcmp (name, "omp_get_num_teams") == 0)
+    gen_get_num_teams (stmt, hbb);
+  else if (strcmp (name, "omp_get_team_num") == 0)
+    gen_get_team_num (stmt, hbb);
+  else if (strcmp (name, "hsa_set_debug_value") == 0)
+    {
+      if (hsa_cfun->has_shadow_reg_p ())
+	{
+	  tree rhs1 = gimple_call_arg (stmt, 0);
+	  hsa_op_with_type *src = hsa_reg_or_immed_for_gimple_op (rhs1, hbb);
+
+	  src = src->get_in_type (BRIG_TYPE_U64, hbb);
+	  set_debug_value (hbb, src);
+	}
+    }
+  else
+    return false;
+
+  return true;
+}
+
+/* Generate HSA instructions for the given kernel call statement CALL.
+   Instructions will be appended to HBB.  */
+
+static void
+gen_hsa_insns_for_kernel_call (hsa_bb *hbb, gcall *call)
+{
+  /* TODO: all emitted instructions assume that we run on a LARGE_MODEL
+     agent.  */
+
+  hsa_insn_mem *mem;
+  hsa_op_address *addr;
+  hsa_op_immed *c;
+
+  hsa_op_reg *shadow_reg_ptr = hsa_cfun->get_shadow_reg ();
+  if (shadow_reg_ptr == NULL)
+    {
+      HSA_SORRY_AT (gimple_location (call),
+		    "support for HSA does not implement kernel dispatch from "
+		    "a function that is not an HSA kernel");
+      return;
+    }
+
+  /* Get my kernel dispatch argument.  */
+  hbb->append_insn (new hsa_insn_comment ("get kernel dispatch structure"));
+  addr = new hsa_op_address
+    (shadow_reg_ptr, offsetof (hsa_kernel_dispatch, children_dispatches));
+
+  hsa_op_reg *shadow_reg_base_ptr = new hsa_op_reg (BRIG_TYPE_U64);
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, shadow_reg_base_ptr,
+			  addr);
+  hbb->append_insn (mem);
+
+  unsigned index = hsa_cfun->m_kernel_dispatch_count;
+  unsigned byte_offset = index * sizeof (hsa_kernel_dispatch *);
+
+  addr = new hsa_op_address (shadow_reg_base_ptr, byte_offset);
+
+  hsa_op_reg *shadow_reg = new hsa_op_reg (BRIG_TYPE_U64);
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, shadow_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Load an address of the command queue to a register.  */
+  hbb->append_insn (new hsa_insn_comment
+		    ("load base address of command queue"));
+
+  hsa_op_reg *queue_reg = new hsa_op_reg (BRIG_TYPE_U64);
+  addr = new hsa_op_address (shadow_reg, offsetof (hsa_kernel_dispatch, queue));
+
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, queue_reg, addr);
+
+  hbb->append_insn (mem);
+
+  /* Load an address of prepared memory for a kernel arguments.  */
+  hbb->append_insn (new hsa_insn_comment ("get a kernarg address"));
+  hsa_op_reg *kernarg_reg = new hsa_op_reg (BRIG_TYPE_U64);
+
+  addr = new hsa_op_address (shadow_reg,
+			     offsetof (hsa_kernel_dispatch, kernarg_address));
+
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, kernarg_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Load an kernel object we want to call.  */
+  hbb->append_insn (new hsa_insn_comment ("get a kernel object"));
+  hsa_op_reg *object_reg = new hsa_op_reg (BRIG_TYPE_U64);
+
+  addr = new hsa_op_address (shadow_reg,
+			     offsetof (hsa_kernel_dispatch, object));
+
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, object_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Get signal prepared for the kernel dispatch.  */
+  hbb->append_insn (new hsa_insn_comment ("get a signal by kernel call index"));
+
+  hsa_op_reg *signal_reg = new hsa_op_reg (BRIG_TYPE_U64);
+  addr = new hsa_op_address (shadow_reg,
+			     offsetof (hsa_kernel_dispatch, signal));
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, signal_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Store to synchronization signal.  */
+  hbb->append_insn (new hsa_insn_comment ("store 1 to signal handle"));
+
+  c = new hsa_op_immed (1, BRIG_TYPE_U64);
+
+  hsa_insn_signal *signal= new hsa_insn_signal (2, BRIG_OPCODE_SIGNALNORET,
+						BRIG_ATOMIC_ST, BRIG_TYPE_B64,
+						signal_reg, c);
+  signal->m_memoryorder = BRIG_MEMORY_ORDER_RELAXED;
+  signal->m_memoryscope = BRIG_MEMORY_SCOPE_SYSTEM;
+  hbb->append_insn (signal);
+
+  /* Get private segment size.  */
+  hsa_op_reg *private_seg_reg = new hsa_op_reg (BRIG_TYPE_U32);
+
+  hbb->append_insn (new hsa_insn_comment
+		    ("get a kernel private segment size by kernel call index"));
+
+  addr = new hsa_op_address
+    (shadow_reg, offsetof (hsa_kernel_dispatch, private_segment_size));
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U32, private_seg_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Get group segment size.  */
+  hsa_op_reg *group_seg_reg = new hsa_op_reg (BRIG_TYPE_U32);
+
+  hbb->append_insn (new hsa_insn_comment
+		    ("get a kernel group segment size by kernel call index"));
+
+  addr = new hsa_op_address
+    (shadow_reg, offsetof (hsa_kernel_dispatch, group_segment_size));
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U32, group_seg_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Get a write index to the command queue.  */
+  hsa_op_reg *queue_index_reg = new hsa_op_reg (BRIG_TYPE_U64);
+
+  c = new hsa_op_immed (1, BRIG_TYPE_U64);
+  hsa_insn_queue *queue = new hsa_insn_queue (3,
+					      BRIG_OPCODE_ADDQUEUEWRITEINDEX);
+
+  addr = new hsa_op_address (queue_reg);
+  queue->set_op (0, queue_index_reg);
+  queue->set_op (1, addr);
+  queue->set_op (2, c);
+
+  hbb->append_insn (queue);
+
+  /* Get packet base address.  */
+  size_t addr_offset = offsetof (hsa_queue, base_address);
+
+  hsa_op_reg *queue_addr_reg = new hsa_op_reg (BRIG_TYPE_U64);
+
+  c = new hsa_op_immed (addr_offset, BRIG_TYPE_U64);
+  hsa_insn_basic *insn = new hsa_insn_basic
+    (3, BRIG_OPCODE_ADD, BRIG_TYPE_U64, queue_addr_reg, queue_reg, c);
+
+  hbb->append_insn (insn);
+
+  hbb->append_insn (new hsa_insn_comment
+		    ("get base address of prepared packet"));
+
+  hsa_op_reg *queue_addr_value_reg = new hsa_op_reg (BRIG_TYPE_U64);
+  addr = new hsa_op_address (queue_addr_reg);
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, queue_addr_value_reg,
+			  addr);
+  hbb->append_insn (mem);
+
+  c = new hsa_op_immed (sizeof (hsa_queue_packet), BRIG_TYPE_U64);
+  hsa_op_reg *queue_packet_offset_reg = new hsa_op_reg (BRIG_TYPE_U64);
+  insn = new hsa_insn_basic
+    (3, BRIG_OPCODE_MUL, BRIG_TYPE_U64, queue_packet_offset_reg,
+     queue_index_reg, c);
+
+  hbb->append_insn (insn);
+
+  hsa_op_reg *queue_packet_reg = new hsa_op_reg (BRIG_TYPE_U64);
+  insn = new hsa_insn_basic
+    (3, BRIG_OPCODE_ADD, BRIG_TYPE_U64, queue_packet_reg, queue_addr_value_reg,
+     queue_packet_offset_reg);
+
+  hbb->append_insn (insn);
+
+
+  /* Write to packet->setup.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->setup |= 1"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, setup));
+  hsa_op_reg *packet_setup_reg = new hsa_op_reg (BRIG_TYPE_U16);
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U16, packet_setup_reg,
+			  addr);
+  hbb->append_insn (mem);
+
+  hsa_op_with_type *packet_setup_u32 = packet_setup_reg->get_in_type
+    (BRIG_TYPE_U32, hbb);
+
+  hsa_op_reg *packet_setup_u32_2 = new hsa_op_reg (BRIG_TYPE_U32);
+  c = new hsa_op_immed (1, BRIG_TYPE_U32);
+  insn = new hsa_insn_basic (3, BRIG_OPCODE_OR, BRIG_TYPE_U32,
+			     packet_setup_u32_2, packet_setup_u32, c);
+
+  hbb->append_insn (insn);
+
+  hsa_op_with_type *packet_setup_reg_2 = packet_setup_u32_2->get_in_type
+    (BRIG_TYPE_U16, hbb);
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, setup));
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, packet_setup_reg_2,
+			  addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->grid_size_x.  If the default value is not changed,
+     emit passed grid_size.  */
+  hsa_op_reg *threads_reg = gen_num_threads_for_dispatch (hbb);
+
+  hbb->append_insn (new hsa_insn_comment
+		    ("set packet->grid_size_x = hsa_num_threads"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, grid_size_x));
+
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, threads_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Write to shadow_reg->omp_num_threads = hsa_num_threads.  */
+  hbb->append_insn (new hsa_insn_comment
+		    ("set shadow_reg->omp_num_threads = hsa_num_threads"));
+
+  addr = new hsa_op_address (shadow_reg, offsetof (hsa_kernel_dispatch,
+						   omp_num_threads));
+  hbb->append_insn
+    (new hsa_insn_mem (BRIG_OPCODE_ST, threads_reg->m_type, threads_reg, addr));
+
+  /* Write to packet->workgroup_size_x.  */
+  hbb->append_insn (new hsa_insn_comment
+		    ("set packet->workgroup_size_x = hsa_num_threads"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, workgroup_size_x));
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, threads_reg,
+			  addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->grid_size_y.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->grid_size_y = 1"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, grid_size_y));
+  c = new hsa_op_immed (1, BRIG_TYPE_U16);
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, c, addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->workgroup_size_y.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->workgroup_size_y = 1"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, workgroup_size_y));
+  c = new hsa_op_immed (1, BRIG_TYPE_U16);
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, c, addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->grid_size_z.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->grid_size_z = 1"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, grid_size_z));
+  c = new hsa_op_immed (1, BRIG_TYPE_U16);
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, c, addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->workgroup_size_z.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->workgroup_size_z = 1"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, workgroup_size_z));
+  c = new hsa_op_immed (1, BRIG_TYPE_U16);
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, c, addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->private_segment_size.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->private_segment_size"));
+
+  hsa_op_with_type *private_seg_reg_u16 = private_seg_reg->get_in_type
+    (BRIG_TYPE_U16, hbb);
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, private_segment_size));
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, private_seg_reg_u16,
+			  addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->group_segment_size.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->group_segment_size"));
+
+  hsa_op_with_type *group_seg_reg_u16 = group_seg_reg->get_in_type
+    (BRIG_TYPE_U16, hbb);
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, group_segment_size));
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U16, group_seg_reg_u16,
+			  addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->kernel_object.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->kernel_object"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, kernel_object));
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U64, object_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Copy locally allocated memory for arguments to a prepared one.  */
+  hbb->append_insn (new hsa_insn_comment ("get address of omp data memory"));
+
+  hsa_op_reg *omp_data_memory_reg = new hsa_op_reg (BRIG_TYPE_U64);
+
+  addr = new hsa_op_address (shadow_reg,
+			     offsetof (hsa_kernel_dispatch, omp_data_memory));
+
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, omp_data_memory_reg,
+			  addr);
+  hbb->append_insn (mem);
+
+  hsa_op_address *dst_addr = new hsa_op_address (omp_data_memory_reg);
+
+  tree argument = gimple_call_arg (call, 1);
+
+  if (TREE_CODE (argument) == ADDR_EXPR)
+    {
+      /* Emit instructions that copy OMP arguments.  */
+
+      tree d = TREE_TYPE (TREE_OPERAND (argument, 0));
+      unsigned omp_data_size = tree_to_uhwi (TYPE_SIZE_UNIT (d));
+      gcc_checking_assert (omp_data_size > 0);
+
+      if (omp_data_size > hsa_cfun->m_maximum_omp_data_size)
+	hsa_cfun->m_maximum_omp_data_size = omp_data_size;
+
+      hsa_symbol *var_decl = get_symbol_for_decl (TREE_OPERAND (argument, 0));
+
+      hbb->append_insn (new hsa_insn_comment ("memory copy instructions"));
+
+      hsa_op_address *src_addr = new hsa_op_address (var_decl);
+      gen_hsa_memory_copy (hbb, dst_addr, src_addr, var_decl->m_dim);
+    }
+  else if (integer_zerop (argument))
+    {
+      /* If NULL argument is passed, do nothing.  */
+    }
+  else
+    gcc_unreachable ();
+
+  hbb->append_insn (new hsa_insn_comment
+		    ("write memory pointer to packet->kernarg_address"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, kernarg_address));
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U64, kernarg_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->kernarg_address.  */
+  hbb->append_insn (new hsa_insn_comment
+		    ("write argument0 to *packet->kernarg_address"));
+
+  addr = new hsa_op_address (kernarg_reg);
+
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U64, omp_data_memory_reg,
+			  addr);
+  hbb->append_insn (mem);
+
+  /* Pass shadow argument to another dispatched kernel module.  */
+  hbb->append_insn (new hsa_insn_comment
+		    ("write argument1 to *packet->kernarg_address"));
+
+  addr = new hsa_op_address (kernarg_reg, 8);
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U64, shadow_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Write to packet->completion_signal.  */
+  hbb->append_insn (new hsa_insn_comment ("set packet->completion_signal"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, completion_signal));
+  mem = new hsa_insn_mem (BRIG_OPCODE_ST, BRIG_TYPE_U64, signal_reg, addr);
+  hbb->append_insn (mem);
+
+  /* Atomically write to packer->header.  */
+  hbb->append_insn
+    (new hsa_insn_comment ("store atomically to packet->header"));
+
+  addr = new hsa_op_address (queue_packet_reg,
+			     offsetof (hsa_queue_packet, header));
+
+  /* Store 5122 << 16 + 1 to packet->header.  */
+  c = new hsa_op_immed (70658, BRIG_TYPE_U32);
+
+  hsa_insn_atomic *atomic = new hsa_insn_atomic
+    (2, BRIG_OPCODE_ATOMICNORET, BRIG_ATOMIC_ST, BRIG_TYPE_B32,
+     BRIG_MEMORY_ORDER_SC_RELEASE, addr, c);
+  atomic->m_memoryscope = BRIG_MEMORY_SCOPE_SYSTEM;
+
+  hbb->append_insn (atomic);
+
+  /* Ring doorbell signal.  */
+  hbb->append_insn (new hsa_insn_comment ("store index to doorbell signal"));
+
+  hsa_op_reg *doorbell_signal_reg = new hsa_op_reg (BRIG_TYPE_U64);
+  addr = new hsa_op_address (queue_reg, offsetof (hsa_queue, doorbell_signal));
+  mem = new hsa_insn_mem (BRIG_OPCODE_LD, BRIG_TYPE_U64, doorbell_signal_reg,
+			  addr);
+  hbb->append_insn (mem);
+
+  signal = new hsa_insn_signal (2, BRIG_OPCODE_SIGNALNORET, BRIG_ATOMIC_ST,
+				BRIG_TYPE_B64, doorbell_signal_reg,
+				queue_index_reg);
+  signal->m_memoryorder = BRIG_MEMORY_ORDER_SC_RELEASE;
+  signal->m_memoryscope = BRIG_MEMORY_SCOPE_SYSTEM;
+  hbb->append_insn (signal);
+
+  /* Prepare CFG for waiting loop.  */
+  edge e = split_block (hbb->m_bb, call);
+
+  basic_block dest = split_edge (e);
+  edge false_e = EDGE_SUCC (dest, 0);
+
+  false_e->flags &= ~EDGE_FALLTHRU;
+  false_e->flags |= EDGE_FALSE_VALUE;
+
+  make_edge (e->dest, dest, EDGE_TRUE_VALUE);
+
+  /* Emit blocking signal waiting instruction.  */
+  hsa_bb *new_hbb = hsa_init_new_bb (dest);
+
+  hbb->append_insn (new hsa_insn_comment ("wait for the signal"));
+
+  hsa_op_reg *signal_result_reg = new hsa_op_reg (BRIG_TYPE_S64);
+  c = new hsa_op_immed (1, BRIG_TYPE_S64);
+
+  signal = new hsa_insn_signal (3, BRIG_OPCODE_SIGNAL,
+				BRIG_ATOMIC_WAIT_LT, BRIG_TYPE_S64);
+  signal->m_memoryorder = BRIG_MEMORY_ORDER_SC_ACQUIRE;
+  signal->m_memoryscope = BRIG_MEMORY_SCOPE_SYSTEM;
+  signal->set_op (0, signal_result_reg);
+  signal->set_op (1, signal_reg);
+  signal->set_op (2, c);
+  new_hbb->append_insn (signal);
+
+  hsa_op_reg *ctrl = new hsa_op_reg (BRIG_TYPE_B1);
+  hsa_insn_cmp *cmp = new hsa_insn_cmp
+    (BRIG_COMPARE_EQ, ctrl->m_type, ctrl, signal_result_reg,
+     new hsa_op_immed (1, signal_result_reg->m_type));
+
+  new_hbb->append_insn (cmp);
+  new_hbb->append_insn (new hsa_insn_br (ctrl));
+
+  hsa_cfun->m_kernel_dispatch_count++;
+}
+
+/* Helper functions to create a single unary HSA operations out of calls to
+   builtins.  OPCODE is the HSA operation to be generated.  STMT is a gimple
+   call to a builtin.  HBB is the HSA BB to which the instruction should be
+   added.  Note that nothing will be created if STMT does not have a LHS.  */
+
+static void
+gen_hsa_unaryop_for_builtin (int opcode, gimple *stmt, hsa_bb *hbb)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+  hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+  hsa_op_with_type *op = hsa_reg_or_immed_for_gimple_op
+    (gimple_call_arg (stmt, 0), hbb);
+  gen_hsa_unary_operation (opcode, dest, op, hbb);
+}
+
+/* Generate HSA address corresponding to a value VAL (as opposed to a memory
+   reference tree), for example an SSA_NAME or an ADDR_EXPR.  HBB is the HSA BB
+   to which the instruction should be added.  */
+
+static hsa_op_address *
+get_address_from_value (tree val, hsa_bb *hbb)
+{
+  switch (TREE_CODE (val))
+    {
+    case SSA_NAME:
+      {
+	BrigType16_t addrtype = hsa_get_segment_addr_type (BRIG_SEGMENT_FLAT);
+	hsa_op_base *reg = hsa_cfun->reg_for_gimple_ssa (val)->get_in_type
+	  (addrtype, hbb);
+	return new hsa_op_address (NULL, as_a <hsa_op_reg *> (reg), 0);
+      }
+    case ADDR_EXPR:
+      return gen_hsa_addr (TREE_OPERAND (val, 0), hbb);
+
+    case INTEGER_CST:
+      if (tree_fits_shwi_p (val))
+	return new hsa_op_address (NULL, NULL, tree_to_shwi (val));
+      /* Otherwise fall-through */
+
+    default:
+      HSA_SORRY_ATV (EXPR_LOCATION (val),
+		     "support for HSA does not implement memory access to %E",
+		     val);
+      return new hsa_op_address (NULL, NULL, 0);
+    }
+}
+
+/* Return string for MEMMODEL.  */
+
+static const char *
+get_memory_order_name (unsigned memmodel)
+{
+  switch (memmodel)
+    {
+    case __ATOMIC_RELAXED:
+      return "__ATOMIC_RELAXED";
+    case __ATOMIC_CONSUME:
+      return "__ATOMIC_CONSUME";
+    case __ATOMIC_ACQUIRE:
+      return "__ATOMIC_ACQUIRE";
+    case __ATOMIC_RELEASE:
+      return "__ATOMIC_RELEASE";
+    case __ATOMIC_ACQ_REL:
+      return "__ATOMIC_ACQ_REL";
+    case __ATOMIC_SEQ_CST:
+      return "__ATOMIC_SEQ_CST";
+    default:
+      return NULL;
+    }
+}
+
+/* Return memory order according to predefined __atomic memory model
+   constants.  LOCATION is provided to locate the problematic statement.  */
+
+static BrigMemoryOrder
+get_memory_order (unsigned memmodel, location_t location)
+{
+  switch (memmodel)
+    {
+    case __ATOMIC_RELAXED:
+      return BRIG_MEMORY_ORDER_RELAXED;
+    case __ATOMIC_ACQUIRE:
+      return BRIG_MEMORY_ORDER_SC_ACQUIRE;
+    case __ATOMIC_RELEASE:
+      return BRIG_MEMORY_ORDER_SC_RELEASE;
+    case __ATOMIC_ACQ_REL:
+      return BRIG_MEMORY_ORDER_SC_ACQUIRE_RELEASE;
+    default:
+      HSA_SORRY_ATV (location,
+		     "support for HSA does not implement memory model: %s",
+		     get_memory_order_name (memmodel));
+      return BRIG_MEMORY_ORDER_NONE;
+    }
+}
+
+/* Helper function to create an HSA atomic binary operation instruction out of
+   calls to atomic builtins.  RET_ORIG is true if the built-in is the variant
+   that return s the value before applying operation, and false if it should
+   return the value after applying the operation (if it returns value at all).
+   ACODE is the atomic operation code, STMT is a gimple call to a builtin.  HBB
+   is the HSA BB to which the instruction should be added.  */
+
+static void
+gen_hsa_ternary_atomic_for_builtin (bool ret_orig,
+ 				    enum BrigAtomicOperation acode, gimple *stmt,
+				    hsa_bb *hbb)
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  tree type = TREE_TYPE (gimple_call_arg (stmt, 1));
+  BrigType16_t hsa_type = hsa_type_for_scalar_tree_type (type, false);
+  BrigType16_t mtype = mem_type_for_type (hsa_type);
+  tree model = gimple_call_arg (stmt, 2);
+
+  if (!tree_fits_uhwi_p (model))
+    {
+      HSA_SORRY_ATV
+	(gimple_location (stmt),
+	 "support for HSA does not implement memory model %E", model);
+      return;
+    }
+
+  unsigned HOST_WIDE_INT mmodel = tree_to_uhwi (model);
+
+  BrigMemoryOrder memorder = get_memory_order
+    (mmodel, gimple_location (stmt));
+
+  /* Certain atomic insns must have Bx memory types.  */
+  switch (acode)
+    {
+    case BRIG_ATOMIC_LD:
+    case BRIG_ATOMIC_ST:
+      mtype = hsa_bittype_for_type (mtype);
+      break;
+    default:
+      break;
+    }
+
+  hsa_op_reg *dest;
+  int nops, opcode;
+  if (lhs)
+    {
+      if (ret_orig)
+	dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+      else
+	dest = new hsa_op_reg (hsa_type);
+      opcode = BRIG_OPCODE_ATOMIC;
+      nops = 3;
+    }
+  else
+    {
+      dest = NULL;
+      opcode = BRIG_OPCODE_ATOMICNORET;
+      nops = 2;
+    }
+
+  if (acode == BRIG_ATOMIC_ST && memorder != BRIG_MEMORY_ORDER_RELAXED
+      && memorder != BRIG_MEMORY_ORDER_SC_RELEASE)
+    {
+      HSA_SORRY_ATV (gimple_location (stmt),
+		     "support for HSA does not implement memory model for "
+		     "ATOMIC_ST: %s", get_memory_order_name (mmodel));
+      return;
+    }
+
+  hsa_insn_atomic *atominsn = new hsa_insn_atomic (nops, opcode, acode, mtype,
+						   memorder);
+
+  hsa_op_address *addr;
+  addr = get_address_from_value (gimple_call_arg (stmt, 0), hbb);
+  /* TODO: Warn if addr has private segment, because the finalizer will not
+     accept that (and it does not make much sense).  */
+  hsa_op_base *op = hsa_reg_or_immed_for_gimple_op (gimple_call_arg (stmt, 1),
+						    hbb);
+
+  if (lhs)
+    {
+      atominsn->set_op (0, dest);
+      atominsn->set_op (1, addr);
+      atominsn->set_op (2, op);
+    }
+  else
+    {
+      atominsn->set_op (0, addr);
+      atominsn->set_op (1, op);
+    }
+
+  hbb->append_insn (atominsn);
+
+  /* HSA does not natively support the variants that return the modified value,
+     so re-do the operation again non-atomically if that is what was
+     requested.  */
+  if (lhs && !ret_orig)
+    {
+      int arith;
+      switch (acode)
+	{
+	case BRIG_ATOMIC_ADD:
+	  arith = BRIG_OPCODE_ADD;
+	  break;
+	case BRIG_ATOMIC_AND:
+	  arith = BRIG_OPCODE_AND;
+	  break;
+	case BRIG_ATOMIC_OR:
+	  arith = BRIG_OPCODE_OR;
+	  break;
+	case BRIG_ATOMIC_SUB:
+	  arith = BRIG_OPCODE_SUB;
+	  break;
+	case BRIG_ATOMIC_XOR:
+	  arith = BRIG_OPCODE_XOR;
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+      hsa_op_reg *real_dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+      gen_hsa_binary_operation (arith, real_dest, dest, op, hbb);
+    }
+}
+
+#define HSA_MEMORY_BUILTINS_LIMIT     128
+
+/* Generate HSA instructions for the given call statement STMT.  Instructions
+   will be appended to HBB.  */
+
+static void
+gen_hsa_insns_for_call (gimple *stmt, hsa_bb *hbb)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  hsa_op_reg *dest;
+
+  if (!gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
+    {
+      tree function_decl = gimple_call_fndecl (stmt);
+      if (function_decl == NULL_TREE)
+	{
+	  HSA_SORRY_AT (gimple_location (stmt),
+			"support for HSA does not implement indirect calls");
+	  return;
+	}
+
+      if (hsa_callable_function_p (function_decl))
+	gen_hsa_insns_for_direct_call (stmt, hbb);
+      else if (!gen_hsa_insns_for_known_library_call (stmt, hbb))
+	HSA_SORRY_AT (gimple_location (stmt),
+		      "HSA does support only call of functions within omp "
+		      "declare target");
+      return;
+    }
+
+  tree fndecl = gimple_call_fndecl (stmt);
+  switch (DECL_FUNCTION_CODE (fndecl))
+    {
+    case BUILT_IN_OMP_GET_THREAD_NUM:
+      {
+	query_hsa_grid (stmt, BRIG_OPCODE_WORKITEMABSID, 0, hbb);
+	break;
+      }
+
+    case BUILT_IN_OMP_GET_NUM_THREADS:
+      {
+	query_hsa_grid (stmt, BRIG_OPCODE_GRIDSIZE, 0, hbb);
+	break;
+      }
+
+    case BUILT_IN_FABS:
+    case BUILT_IN_FABSF:
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_ABS, stmt, hbb);
+      break;
+
+    case BUILT_IN_CEIL:
+    case BUILT_IN_CEILF:
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_CEIL, stmt, hbb);
+      break;
+
+    case BUILT_IN_FLOOR:
+    case BUILT_IN_FLOORF:
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_FLOOR, stmt, hbb);
+      break;
+
+    case BUILT_IN_RINT:
+    case BUILT_IN_RINTF:
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_RINT, stmt, hbb);
+      break;
+
+    case BUILT_IN_SQRT:
+    case BUILT_IN_SQRTF:
+      /* TODO: Perhaps produce BRIG_OPCODE_NSQRT with -ffast-math?  */
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_SQRT, stmt, hbb);
+      break;
+
+    case BUILT_IN_TRUNC:
+    case BUILT_IN_TRUNCF:
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_TRUNC, stmt, hbb);
+      break;
+
+    case BUILT_IN_COS:
+    case BUILT_IN_COSF:
+      /* FIXME: Using the native instruction may not be precise enough.
+	 Perhaps only allow if using -ffast-math?  */
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_NCOS, stmt, hbb);
+      break;
+
+    case BUILT_IN_EXP2:
+    case BUILT_IN_EXP2F:
+      /* FIXME: Using the native instruction may not be precise enough.
+	 Perhaps only allow if using -ffast-math?  */
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_NEXP2, stmt, hbb);
+      break;
+
+    case BUILT_IN_LOG2:
+    case BUILT_IN_LOG2F:
+      /* FIXME: Using the native instruction may not be precise enough.
+	 Perhaps only allow if using -ffast-math?  */
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_NLOG2, stmt, hbb);
+      break;
+
+    case BUILT_IN_SIN:
+    case BUILT_IN_SINF:
+      /* FIXME: Using the native instruction may not be precise enough.
+	 Perhaps only allow if using -ffast-math?  */
+      gen_hsa_unaryop_for_builtin (BRIG_OPCODE_NSIN, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_LOAD_1:
+    case BUILT_IN_ATOMIC_LOAD_2:
+    case BUILT_IN_ATOMIC_LOAD_4:
+    case BUILT_IN_ATOMIC_LOAD_8:
+    case BUILT_IN_ATOMIC_LOAD_16:
+      {
+	BrigType16_t mtype;
+	hsa_op_address *addr;
+	addr = get_address_from_value (gimple_call_arg (stmt, 0), hbb);
+	tree model = gimple_call_arg (stmt, 1);
+	if (!tree_fits_uhwi_p (model))
+	  {
+	    HSA_SORRY_ATV
+	      (gimple_location (stmt),
+	       "support for HSA does not implement memory model: %E", model);
+	    return;
+	  }
+
+	unsigned HOST_WIDE_INT mmodel = tree_to_uhwi (model);
+	BrigMemoryOrder memorder = get_memory_order (mmodel,
+						     gimple_location (stmt));
+
+	if (memorder != BRIG_MEMORY_ORDER_RELAXED
+	    && memorder != BRIG_MEMORY_ORDER_SC_RELEASE)
+	  {
+	    HSA_SORRY_ATV
+	      (gimple_location (stmt),
+	       "support for HSA does not implement memory model for "
+	       "ATOMIC_LD: %s", get_memory_order_name (mmodel));
+	    return;
+	  }
+
+	if (lhs)
+	  {
+	    mtype = mem_type_for_type
+	      (hsa_type_for_scalar_tree_type (TREE_TYPE (lhs), false));
+	    mtype = hsa_bittype_for_type (mtype);
+	    dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+	  }
+	else
+	  {
+	    mtype = BRIG_TYPE_B64;
+	    dest = new hsa_op_reg (mtype);
+	  }
+
+	hsa_insn_atomic *atominsn
+	  = new hsa_insn_atomic (2, BRIG_OPCODE_ATOMIC, BRIG_ATOMIC_LD, mtype,
+				 memorder, dest, addr);
+
+	hbb->append_insn (atominsn);
+	break;
+      }
+
+    case BUILT_IN_ATOMIC_EXCHANGE_1:
+    case BUILT_IN_ATOMIC_EXCHANGE_2:
+    case BUILT_IN_ATOMIC_EXCHANGE_4:
+    case BUILT_IN_ATOMIC_EXCHANGE_8:
+    case BUILT_IN_ATOMIC_EXCHANGE_16:
+      gen_hsa_ternary_atomic_for_builtin (true, BRIG_ATOMIC_EXCH, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_FETCH_ADD_1:
+    case BUILT_IN_ATOMIC_FETCH_ADD_2:
+    case BUILT_IN_ATOMIC_FETCH_ADD_4:
+    case BUILT_IN_ATOMIC_FETCH_ADD_8:
+    case BUILT_IN_ATOMIC_FETCH_ADD_16:
+      gen_hsa_ternary_atomic_for_builtin (true, BRIG_ATOMIC_ADD, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_FETCH_SUB_1:
+    case BUILT_IN_ATOMIC_FETCH_SUB_2:
+    case BUILT_IN_ATOMIC_FETCH_SUB_4:
+    case BUILT_IN_ATOMIC_FETCH_SUB_8:
+    case BUILT_IN_ATOMIC_FETCH_SUB_16:
+      gen_hsa_ternary_atomic_for_builtin (true, BRIG_ATOMIC_SUB, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_FETCH_AND_1:
+    case BUILT_IN_ATOMIC_FETCH_AND_2:
+    case BUILT_IN_ATOMIC_FETCH_AND_4:
+    case BUILT_IN_ATOMIC_FETCH_AND_8:
+    case BUILT_IN_ATOMIC_FETCH_AND_16:
+      gen_hsa_ternary_atomic_for_builtin (true, BRIG_ATOMIC_AND, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_FETCH_XOR_1:
+    case BUILT_IN_ATOMIC_FETCH_XOR_2:
+    case BUILT_IN_ATOMIC_FETCH_XOR_4:
+    case BUILT_IN_ATOMIC_FETCH_XOR_8:
+    case BUILT_IN_ATOMIC_FETCH_XOR_16:
+      gen_hsa_ternary_atomic_for_builtin (true, BRIG_ATOMIC_XOR, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_FETCH_OR_1:
+    case BUILT_IN_ATOMIC_FETCH_OR_2:
+    case BUILT_IN_ATOMIC_FETCH_OR_4:
+    case BUILT_IN_ATOMIC_FETCH_OR_8:
+    case BUILT_IN_ATOMIC_FETCH_OR_16:
+      gen_hsa_ternary_atomic_for_builtin (true, BRIG_ATOMIC_OR, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_STORE_1:
+    case BUILT_IN_ATOMIC_STORE_2:
+    case BUILT_IN_ATOMIC_STORE_4:
+    case BUILT_IN_ATOMIC_STORE_8:
+    case BUILT_IN_ATOMIC_STORE_16:
+      /* Since there cannot be any LHS, the first parameter is meaningless.  */
+      gen_hsa_ternary_atomic_for_builtin (true, BRIG_ATOMIC_ST, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_ADD_FETCH_1:
+    case BUILT_IN_ATOMIC_ADD_FETCH_2:
+    case BUILT_IN_ATOMIC_ADD_FETCH_4:
+    case BUILT_IN_ATOMIC_ADD_FETCH_8:
+    case BUILT_IN_ATOMIC_ADD_FETCH_16:
+      gen_hsa_ternary_atomic_for_builtin (false, BRIG_ATOMIC_ADD, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_SUB_FETCH_1:
+    case BUILT_IN_ATOMIC_SUB_FETCH_2:
+    case BUILT_IN_ATOMIC_SUB_FETCH_4:
+    case BUILT_IN_ATOMIC_SUB_FETCH_8:
+    case BUILT_IN_ATOMIC_SUB_FETCH_16:
+      gen_hsa_ternary_atomic_for_builtin (false, BRIG_ATOMIC_SUB, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_AND_FETCH_1:
+    case BUILT_IN_ATOMIC_AND_FETCH_2:
+    case BUILT_IN_ATOMIC_AND_FETCH_4:
+    case BUILT_IN_ATOMIC_AND_FETCH_8:
+    case BUILT_IN_ATOMIC_AND_FETCH_16:
+      gen_hsa_ternary_atomic_for_builtin (false, BRIG_ATOMIC_AND, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_XOR_FETCH_1:
+    case BUILT_IN_ATOMIC_XOR_FETCH_2:
+    case BUILT_IN_ATOMIC_XOR_FETCH_4:
+    case BUILT_IN_ATOMIC_XOR_FETCH_8:
+    case BUILT_IN_ATOMIC_XOR_FETCH_16:
+      gen_hsa_ternary_atomic_for_builtin (false, BRIG_ATOMIC_XOR, stmt, hbb);
+      break;
+
+    case BUILT_IN_ATOMIC_OR_FETCH_1:
+    case BUILT_IN_ATOMIC_OR_FETCH_2:
+    case BUILT_IN_ATOMIC_OR_FETCH_4:
+    case BUILT_IN_ATOMIC_OR_FETCH_8:
+    case BUILT_IN_ATOMIC_OR_FETCH_16:
+      gen_hsa_ternary_atomic_for_builtin (false, BRIG_ATOMIC_OR, stmt, hbb);
+      break;
+
+    case BUILT_IN_SYNC_VAL_COMPARE_AND_SWAP_1:
+    case BUILT_IN_SYNC_VAL_COMPARE_AND_SWAP_2:
+    case BUILT_IN_SYNC_VAL_COMPARE_AND_SWAP_4:
+    case BUILT_IN_SYNC_VAL_COMPARE_AND_SWAP_8:
+    case BUILT_IN_SYNC_VAL_COMPARE_AND_SWAP_16:
+      {
+	/* XXX Ignore mem model for now.  */
+	tree type = TREE_TYPE (gimple_call_arg (stmt, 1));
+
+	BrigType16_t atype  = hsa_bittype_for_type
+	  (hsa_type_for_scalar_tree_type (type, false));
+
+	hsa_insn_atomic *atominsn = new hsa_insn_atomic
+	  (4, BRIG_OPCODE_ATOMIC, BRIG_ATOMIC_CAS, atype,
+	   BRIG_MEMORY_ORDER_SC_ACQUIRE_RELEASE);
+	hsa_op_address *addr;
+	addr = get_address_from_value (gimple_call_arg (stmt, 0), hbb);
+
+	if (lhs != NULL)
+	  dest = hsa_cfun->reg_for_gimple_ssa (lhs);
+	else
+	  dest = new hsa_op_reg (atype);
+
+	/* Should check what the memory scope is */
+	atominsn->m_memoryscope = BRIG_MEMORY_SCOPE_WORKGROUP;
+	atominsn->set_op (0, dest);
+	atominsn->set_op (1, addr);
+	atominsn->set_op
+	  (2, hsa_reg_or_immed_for_gimple_op (gimple_call_arg (stmt, 1), hbb));
+	atominsn->set_op
+	  (3, hsa_reg_or_immed_for_gimple_op (gimple_call_arg (stmt, 2), hbb));
+
+	hbb->append_insn (atominsn);
+	break;
+      }
+    case BUILT_IN_GOMP_PARALLEL:
+      {
+	gcc_checking_assert (gimple_call_num_args (stmt) == 4);
+	tree called = gimple_call_arg (stmt, 0);
+	gcc_checking_assert (TREE_CODE (called) == ADDR_EXPR);
+	called = TREE_OPERAND (called, 0);
+	gcc_checking_assert (TREE_CODE (called) == FUNCTION_DECL);
+
+	hsa_add_kernel_dependency
+	  (hsa_cfun->m_decl,
+	   hsa_brig_function_name (hsa_get_declaration_name (called)));
+	gen_hsa_insns_for_kernel_call (hbb, as_a <gcall *> (stmt));
+
+	break;
+      }
+    case BUILT_IN_GOMP_TEAMS:
+      {
+	gen_set_num_threads (gimple_call_arg (stmt, 1), hbb);
+	break;
+      }
+    case BUILT_IN_OMP_GET_NUM_TEAMS:
+      {
+	gen_get_num_teams (stmt, hbb);
+	break;
+      }
+    case BUILT_IN_OMP_GET_TEAM_NUM:
+      {
+	gen_get_team_num (stmt, hbb);
+	break;
+      }
+    case BUILT_IN_MEMCPY:
+      {
+	tree byte_size = gimple_call_arg (stmt, 2);
+
+	if (TREE_CODE (byte_size) != INTEGER_CST)
+	  {
+	    gen_hsa_insns_for_direct_call (stmt, hbb);
+	    return;
+	  }
+
+	unsigned n = tree_to_uhwi (byte_size);
+
+	if (n > HSA_MEMORY_BUILTINS_LIMIT)
+	  {
+	    gen_hsa_insns_for_direct_call (stmt, hbb);
+	    return;
+	  }
+
+	tree dst = gimple_call_arg (stmt, 0);
+	tree src = gimple_call_arg (stmt, 1);
+
+	hsa_op_address *dst_addr = get_address_from_value (dst, hbb);
+	hsa_op_address *src_addr = get_address_from_value (src, hbb);
+
+	gen_hsa_memory_copy (hbb, dst_addr, src_addr, n);
+
+	tree lhs = gimple_call_lhs (stmt);
+	if (lhs)
+	  gen_hsa_insns_for_single_assignment (lhs, dst, hbb);
+
+	break;
+      }
+    case BUILT_IN_MEMSET:
+      {
+	tree dst = gimple_call_arg (stmt, 0);
+	tree c = gimple_call_arg (stmt, 1);
+
+	if (TREE_CODE (c) != INTEGER_CST)
+	  {
+	    gen_hsa_insns_for_direct_call (stmt, hbb);
+	    return;
+	  }
+
+	tree byte_size = gimple_call_arg (stmt, 2);
+
+	if (TREE_CODE (byte_size) != INTEGER_CST)
+	  {
+	    gen_hsa_insns_for_direct_call (stmt, hbb);
+	    return;
+	  }
+
+	unsigned n = tree_to_uhwi (byte_size);
+
+	if (n > HSA_MEMORY_BUILTINS_LIMIT)
+	  {
+	    gen_hsa_insns_for_direct_call (stmt, hbb);
+	    return;
+	  }
+
+	hsa_op_address *dst_addr;
+	dst_addr = get_address_from_value (dst, hbb);
+	unsigned HOST_WIDE_INT constant = tree_to_uhwi
+	  (fold_convert (unsigned_char_type_node, c));
+
+	gen_hsa_memory_set (hbb, dst_addr, constant, n);
+
+	tree lhs = gimple_call_lhs (stmt);
+	if (lhs)
+	  gen_hsa_insns_for_single_assignment (lhs, dst, hbb);
+
+	break;
+      }
+    default:
+      {
+	gen_hsa_insns_for_direct_call (stmt, hbb);
+	return;
+      }
+    }
+}
+
+/* Generate HSA instructions for a given gimple statement.  Instructions will be
+   appended to HBB.  */
+
+static void
+gen_hsa_insns_for_gimple_stmt (gimple *stmt, hsa_bb *hbb)
+{
+  switch (gimple_code (stmt))
+    {
+    case GIMPLE_ASSIGN:
+      if (gimple_clobber_p (stmt))
+	break;
+
+      if (gimple_assign_single_p (stmt))
+	{
+	  tree lhs = gimple_assign_lhs (stmt);
+	  tree rhs = gimple_assign_rhs1 (stmt);
+	  gen_hsa_insns_for_single_assignment (lhs, rhs, hbb);
+	}
+      else
+	gen_hsa_insns_for_operation_assignment (stmt, hbb);
+      break;
+    case GIMPLE_RETURN:
+      gen_hsa_insns_for_return (as_a <greturn *> (stmt), hbb);
+      break;
+    case GIMPLE_COND:
+      gen_hsa_insns_for_cond_stmt (stmt, hbb);
+      break;
+    case GIMPLE_CALL:
+      gen_hsa_insns_for_call (stmt, hbb);
+      break;
+    case GIMPLE_DEBUG:
+      /* ??? HSA supports some debug facilities.  */
+      break;
+    case GIMPLE_LABEL:
+    {
+      tree label = gimple_label_label (as_a <glabel *> (stmt));
+      if (FORCED_LABEL (label))
+	HSA_SORRY_AT (gimple_location (stmt),
+		      "support for HSA does not implement gimple label with "
+		      "address taken");
+
+      break;
+    }
+    case GIMPLE_NOP:
+    {
+      hbb->append_insn (new hsa_insn_basic (0, BRIG_OPCODE_NOP));
+      break;
+    }
+    case GIMPLE_SWITCH:
+    {
+      gen_hsa_insns_for_switch_stmt (as_a <gswitch *> (stmt), hbb);
+      break;
+    }
+    default:
+      HSA_SORRY_ATV (gimple_location (stmt),
+		     "support for HSA does not implement gimple statement %s",
+		     gimple_code_name[(int) gimple_code (stmt)]);
+    }
+}
+
+/* Generate a HSA PHI from a gimple PHI.  */
+
+static void
+gen_hsa_phi_from_gimple_phi (gimple *phi_stmt, hsa_bb *hbb)
+{
+  hsa_insn_phi *hphi;
+  unsigned count = gimple_phi_num_args (phi_stmt);
+
+  hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa
+    (gimple_phi_result (phi_stmt));
+  hphi = new hsa_insn_phi (count, dest);
+  hphi->m_bb = hbb->m_bb;
+
+  tree lhs = gimple_phi_result (phi_stmt);
+
+  for (unsigned i = 0; i < count; i++)
+    {
+      tree op = gimple_phi_arg_def (phi_stmt, i);
+
+      if (TREE_CODE (op) == SSA_NAME)
+	{
+	  hsa_op_reg *hreg = hsa_cfun->reg_for_gimple_ssa (op);
+	  hphi->set_op (i, hreg);
+	}
+      else
+	{
+	  gcc_assert (is_gimple_min_invariant (op));
+	  tree t = TREE_TYPE (op);
+	  if (!POINTER_TYPE_P (t)
+	      || (TREE_CODE (op) == STRING_CST
+		  && TREE_CODE (TREE_TYPE (t)) == INTEGER_TYPE))
+	    hphi->set_op (i, new hsa_op_immed (op));
+	  else if (POINTER_TYPE_P (TREE_TYPE (lhs))
+		   && TREE_CODE (op) == INTEGER_CST)
+	    {
+	      /* Handle assignment of NULL value to a pointer type.  */
+	      hphi->set_op (i, new hsa_op_immed (op));
+	    }
+	  else if (TREE_CODE (op) == ADDR_EXPR)
+	    {
+	      edge e = gimple_phi_arg_edge (as_a <gphi *> (phi_stmt), i);
+	      hsa_bb *hbb_src = hsa_init_new_bb (split_edge (e));
+	      hsa_op_address *addr = gen_hsa_addr (TREE_OPERAND (op, 0),
+						   hbb_src);
+
+	      hsa_op_reg *dest = new hsa_op_reg (BRIG_TYPE_U64);
+	      hsa_insn_basic *insn = new  hsa_insn_basic
+		(2, BRIG_OPCODE_LDA, BRIG_TYPE_U64, dest, addr);
+	      hbb_src->append_insn (insn);
+
+	      hphi->set_op (i, dest);
+	    }
+	  else
+	    {
+	      HSA_SORRY_AT (gimple_location (phi_stmt),
+			    "support for HSA does not handle PHI nodes with "
+			    "constant address operands");
+	      return;
+	    }
+	}
+    }
+
+  hphi->m_prev = hbb->m_last_phi;
+  hphi->m_next = NULL;
+  if (hbb->m_last_phi)
+    hbb->m_last_phi->m_next = hphi;
+  hbb->m_last_phi = hphi;
+  if (!hbb->m_first_phi)
+    hbb->m_first_phi = hphi;
+}
+
+/* Constructor of class containing HSA-specific information about a basic
+   block.  CFG_BB is the CFG BB this HSA BB is associated with.  IDX is the new
+   index of this BB (so that the constructor does not attempt to use
+   hsa_cfun during its construction).  */
+
+hsa_bb::hsa_bb (basic_block cfg_bb, int idx): m_bb (cfg_bb),
+  m_first_insn (NULL), m_last_insn (NULL), m_first_phi (NULL),
+  m_last_phi (NULL), m_index (idx), m_liveout (BITMAP_ALLOC (NULL)),
+  m_livein (BITMAP_ALLOC (NULL))
+{
+  gcc_assert (!cfg_bb->aux);
+  cfg_bb->aux = this;
+}
+
+/* Constructor of class containing HSA-specific information about a basic
+   block.  CFG_BB is the CFG BB this HSA BB is associated with.  */
+
+hsa_bb::hsa_bb (basic_block cfg_bb): m_bb (cfg_bb),
+  m_first_insn (NULL), m_last_insn (NULL), m_first_phi (NULL),
+  m_last_phi (NULL), m_index (hsa_cfun->m_hbb_count++),
+  m_liveout (BITMAP_ALLOC (NULL)), m_livein (BITMAP_ALLOC (NULL))
+{
+  gcc_assert (!cfg_bb->aux);
+  cfg_bb->aux = this;
+}
+
+/* Destructor of class representing HSA BB.  */
+
+hsa_bb::~hsa_bb ()
+{
+  BITMAP_FREE (m_livein);
+  BITMAP_FREE (m_liveout);
+}
+
+/* Create and initialize and return a new hsa_bb structure for a given CFG
+   basic block BB.  */
+
+hsa_bb *
+hsa_init_new_bb (basic_block bb)
+{
+  return new (hsa_allocp_bb) hsa_bb (bb);
+}
+
+/* Initialize OMP in an HSA basic block PROLOGUE.  */
+
+static void
+init_prologue (void)
+{
+  if (!hsa_cfun->m_kern_p)
+    return;
+
+  hsa_bb *prologue = hsa_bb_for_bb (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  /* Create a magic number that is going to be printed by libgomp.  */
+  unsigned index = hsa_get_number_decl_kernel_mappings ();
+
+  /* Emit store to debug argument.  */
+  if (PARAM_VALUE (PARAM_HSA_GEN_DEBUG_STORES) > 0)
+    set_debug_value (prologue, new hsa_op_immed (1000 + index, BRIG_TYPE_U64));
+}
+
+/* Initialize hsa_num_threads to a default value.  */
+
+static void
+init_hsa_num_threads (void)
+{
+  hsa_bb *prologue = hsa_bb_for_bb (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  /* Save the default value to private variable hsa_num_threads.  */
+  hsa_insn_basic *basic = new hsa_insn_mem
+    (BRIG_OPCODE_ST, hsa_num_threads->m_type,
+     new hsa_op_immed (0, hsa_num_threads->m_type),
+     new hsa_op_address (hsa_num_threads));
+  prologue->append_insn (basic);
+}
+
+/* Go over gimple representation and generate our internal HSA one.  */
+
+static void
+gen_body_from_gimple ()
+{
+  basic_block bb;
+
+  /* Verify CFG for complex edges we are unable to handle.  */
+  edge_iterator ei;
+  edge e;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  /* Verify all unsupported flags for edges that point
+	     to the same basic block.  */
+	  if (e->flags & EDGE_EH)
+	    {
+	      HSA_SORRY_AT
+		(UNKNOWN_LOCATION,
+		 "support for HSA does not implement exception handling");
+	      return;
+	    }
+	}
+    }
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      gimple_stmt_iterator gsi;
+      hsa_bb *hbb = hsa_bb_for_bb (bb);
+      if (hbb)
+	continue;
+
+      hbb = hsa_init_new_bb (bb);
+
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gen_hsa_insns_for_gimple_stmt (gsi_stmt (gsi), hbb);
+	  if (hsa_seen_error ())
+	    return;
+	}
+    }
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      gimple_stmt_iterator gsi;
+      hsa_bb *hbb = hsa_bb_for_bb (bb);
+      gcc_assert (hbb != NULL);
+
+      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	if (!virtual_operand_p (gimple_phi_result (gsi_stmt (gsi))))
+	  gen_hsa_phi_from_gimple_phi (gsi_stmt (gsi), hbb);
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "------- Generated SSA form -------\n");
+      dump_hsa_cfun (dump_file);
+    }
+}
+
+static void
+gen_function_decl_parameters (hsa_function_representation *f,
+			      tree decl)
+{
+  tree parm;
+  unsigned i;
+
+  for (parm = TYPE_ARG_TYPES (TREE_TYPE (decl)), i = 0;
+       parm;
+       parm = TREE_CHAIN (parm), i++)
+    {
+      /* Result type if last in the tree list.  */
+      if (TREE_CHAIN (parm) == NULL)
+	break;
+
+      tree v = TREE_VALUE (parm);
+
+      hsa_symbol *arg = new hsa_symbol (BRIG_TYPE_NONE, BRIG_SEGMENT_ARG,
+					BRIG_LINKAGE_NONE);
+      arg->m_type = hsa_type_for_tree_type (v, &arg->m_dim);
+      arg->m_name_number = i;
+
+      f->m_input_args.safe_push (arg);
+    }
+
+  tree result_type = TREE_TYPE (TREE_TYPE (decl));
+  if (!VOID_TYPE_P (result_type))
+    {
+      f->m_output_arg = new hsa_symbol (BRIG_TYPE_NONE, BRIG_SEGMENT_ARG,
+					BRIG_LINKAGE_NONE);
+      f->m_output_arg->m_type = hsa_type_for_tree_type
+	(result_type, &f->m_output_arg->m_dim);
+      f->m_output_arg->m_name = "res";
+    }
+}
+
+/* Generate the vector of parameters of the HSA representation of the current
+   function.  This also includes the output parameter representing the
+   result.  */
+
+static void
+gen_function_def_parameters ()
+{
+  tree parm;
+
+  hsa_bb *prologue = hsa_bb_for_bb (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  for (parm = DECL_ARGUMENTS (cfun->decl); parm;
+       parm = DECL_CHAIN (parm))
+    {
+      struct hsa_symbol **slot;
+
+      hsa_symbol *arg = new hsa_symbol
+	(BRIG_TYPE_NONE,
+	 hsa_cfun->m_kern_p ? BRIG_SEGMENT_KERNARG : BRIG_SEGMENT_ARG,
+	 BRIG_LINKAGE_FUNCTION);
+      arg->fillup_for_decl (parm);
+
+      hsa_cfun->m_input_args.safe_push (arg);
+
+      if (hsa_seen_error ())
+	return;
+
+      arg->m_name = hsa_get_declaration_name (parm);
+
+      /* Copy all input arguments and create corresponding private symbols
+	 for them.  */
+      hsa_symbol *private_arg;
+      hsa_op_address *parm_addr = new hsa_op_address (arg);
+
+      if (TREE_ADDRESSABLE (parm)
+	  || (!is_gimple_reg (parm) && !TREE_READONLY (parm)))
+	{
+	  private_arg = hsa_cfun->create_hsa_temporary (arg->m_type);
+	  private_arg->fillup_for_decl (parm);
+
+	  hsa_op_address *private_arg_addr = new hsa_op_address (private_arg);
+	  gen_hsa_memory_copy (prologue, private_arg_addr, parm_addr,
+			       arg->total_byte_size ());
+	}
+      else
+	private_arg = arg;
+
+      slot = hsa_cfun->m_local_symbols->find_slot (private_arg, INSERT);
+      gcc_assert (!*slot);
+      *slot = private_arg;
+
+      if (is_gimple_reg (parm))
+	{
+	  tree ddef = ssa_default_def (cfun, parm);
+	  if (ddef && !has_zero_uses (ddef))
+	    {
+	      BrigType16_t mtype = mem_type_for_type
+		(hsa_type_for_scalar_tree_type (TREE_TYPE (ddef), false));
+	      hsa_op_reg *dest = hsa_cfun->reg_for_gimple_ssa (ddef);
+	      hsa_insn_mem *mem = new hsa_insn_mem (BRIG_OPCODE_LD, mtype,
+						    dest, parm_addr);
+	      gcc_assert (!parm_addr->m_reg);
+	      prologue->append_insn (mem);
+	    }
+	}
+    }
+
+  if (!VOID_TYPE_P (TREE_TYPE (TREE_TYPE (cfun->decl))))
+    {
+      struct hsa_symbol **slot;
+
+      hsa_cfun->m_output_arg = new hsa_symbol (BRIG_TYPE_NONE, BRIG_SEGMENT_ARG,
+					       BRIG_LINKAGE_FUNCTION);
+      hsa_cfun->m_output_arg->fillup_for_decl (DECL_RESULT (cfun->decl));
+
+      if (hsa_seen_error ())
+	return;
+
+      hsa_cfun->m_output_arg->m_name = "res";
+      slot = hsa_cfun->m_local_symbols->find_slot (hsa_cfun->m_output_arg,
+						   INSERT);
+      gcc_assert (!*slot);
+      *slot = hsa_cfun->m_output_arg;
+    }
+}
+
+/* Generate function representation that corresponds to
+   a function declaration.  */
+
+hsa_function_representation *
+hsa_generate_function_declaration (tree decl)
+{
+  hsa_function_representation *fun = new hsa_function_representation
+    (decl, false, 0);
+
+  fun->m_declaration_p = true;
+  fun->m_name = get_brig_function_name (decl);
+  gen_function_decl_parameters (fun, decl);
+
+  return fun;
+}
+
+/* Return true if switch statement S can be transformed
+   to a SBR instruction in HSAIL.  */
+
+static bool
+transformable_switch_to_sbr_p (gswitch *s)
+{
+  /* Identify if a switch statement can be transformed to
+     SBR instruction, like:
+
+     sbr_u32 $s1 [@label1, @label2, @label3];
+  */
+
+  tree size = get_switch_size (s);
+  if (!tree_fits_uhwi_p (size))
+    return false;
+
+  if (tree_to_uhwi (size) > HSA_MAXIMUM_SBR_LABELS)
+    return false;
+
+  return true;
+}
+
+/* Structure hold connection between PHI nodes and immediate
+   values hold by there nodes.  */
+
+struct phi_definition
+{
+  phi_definition (unsigned phi_i, unsigned label_i, tree imm):
+    phi_index (phi_i), label_index (label_i), phi_value (imm)
+  {}
+
+  unsigned phi_index;
+  unsigned label_index;
+  tree phi_value;
+};
+
+/* Sum slice of a vector V, starting from index START and ending
+   at the index END - 1.  */
+
+template <typename T>
+static
+T sum_slice (const auto_vec <T> &v, unsigned start, unsigned end)
+{
+  T s = 0;
+
+  for (unsigned i = start; i < end; i++)
+    s += v[i];
+
+  return s;
+}
+
+/* Function transforms GIMPLE SWITCH statements to a series of IF statements.
+   Let's assume following example:
+
+L0:
+   switch (index)
+     case C1:
+L1:    hard_work_1 ();
+       break;
+     case C2..C3:
+L2:    hard_work_2 ();
+       break;
+     default:
+LD:    hard_work_3 ();
+       break;
+
+  The transformation encompasses following steps:
+    1) all immediate values used by edges coming from the switch basic block
+       are saved
+    2) all these edges are removed
+    3) the switch statement (in L0) is replaced by:
+	 if (index == C1)
+	   goto L1;
+	 else
+	   goto L1';
+
+    4) newly created basic block Lx' is used for generation of
+       a next condition
+    5) else branch of the last condition goes to LD
+    6) fix all immediate values in PHI nodes that were propagated though
+       edges that were removed in step 2
+
+  Note: if a case is made by a range C1..C2, then process
+	following transformation:
+
+  switch_cond_op1 = C1 <= index;
+  switch_cond_op2 = index <= C2;
+  switch_cond_and = switch_cond_op1 & switch_cond_op2;
+  if (switch_cond_and != 0)
+    goto Lx;
+  else
+    goto Ly;
+
+*/
+
+static void
+convert_switch_statements ()
+{
+  function *func = DECL_STRUCT_FUNCTION (current_function_decl);
+  basic_block bb;
+
+  bool need_update = false;
+
+  FOR_EACH_BB_FN (bb, func)
+  {
+    gimple_stmt_iterator gsi = gsi_last_bb (bb);
+    if (gsi_end_p (gsi))
+      continue;
+
+    gimple *stmt = gsi_stmt (gsi);
+
+    if (gimple_code (stmt) == GIMPLE_SWITCH)
+      {
+	gswitch *s = as_a <gswitch *> (stmt);
+
+	/* If the switch can utilize SBR insn, skip the statement.  */
+	if (transformable_switch_to_sbr_p (s))
+	  continue;
+
+	need_update = true;
+
+	unsigned labels = gimple_switch_num_labels (s);
+	tree index = gimple_switch_index (s);
+	tree index_type = TREE_TYPE (index);
+	tree default_label = gimple_switch_default_label (s);
+	basic_block default_label_bb = label_to_block_fn
+	  (func, CASE_LABEL (default_label));
+	basic_block cur_bb = bb;
+
+	auto_vec <edge> new_edges;
+	auto_vec <phi_definition *> phi_todo_list;
+	auto_vec <gcov_type> edge_counts;
+	auto_vec <int> edge_probabilities;
+
+	/* Investigate all labels that and PHI nodes in these edges which
+	   should be fixed after we add new collection of edges.  */
+	for (unsigned i = 0; i < labels; i++)
+	  {
+	    tree label = gimple_switch_label (s, i);
+	    basic_block label_bb = label_to_block_fn (func, CASE_LABEL (label));
+	    edge e = find_edge (bb, label_bb);
+	    edge_counts.safe_push (e->count);
+	    edge_probabilities.safe_push (e->probability);
+	    gphi_iterator phi_gsi;
+
+	    /* Save PHI definitions that will be destroyed because of an edge
+	       is going to be removed.  */
+	    unsigned phi_index = 0;
+	    for (phi_gsi = gsi_start_phis (e->dest);
+		 !gsi_end_p (phi_gsi); gsi_next (&phi_gsi))
+	      {
+		gphi *phi = phi_gsi.phi ();
+		for (unsigned j = 0; j < gimple_phi_num_args (phi); j++)
+		  {
+		    if (gimple_phi_arg_edge (phi, j) == e)
+		      {
+			tree imm = gimple_phi_arg_def (phi, j);
+			phi_todo_list.safe_push
+			  (new phi_definition (phi_index, i, imm));
+			break;
+		      }
+		  }
+		phi_index++;
+	      }
+	  }
+
+	/* Remove all edges for the current basic block.  */
+	for (int i = EDGE_COUNT (bb->succs) - 1; i >= 0; i--)
+ 	  {
+	    edge e = EDGE_SUCC (bb, i);
+	    remove_edge (e);
+	  }
+
+	/* Iterate all non-default labels.  */
+	for (unsigned i = 1; i < labels; i++)
+	  {
+	    tree label = gimple_switch_label (s, i);
+	    tree low = CASE_LOW (label);
+	    tree high = CASE_HIGH (label);
+
+	    if (!useless_type_conversion_p (TREE_TYPE (low), index_type))
+	      low = fold_convert (index_type, low);
+
+	    gimple_stmt_iterator cond_gsi = gsi_last_bb (cur_bb);
+	    gimple *c = NULL;
+	    if (high)
+	      {
+		tree tmp1 = make_temp_ssa_name (boolean_type_node, NULL,
+						"switch_cond_op1");
+
+		gimple *assign1 = gimple_build_assign (tmp1, LE_EXPR, low,
+						      index);
+
+		tree tmp2 = make_temp_ssa_name (boolean_type_node, NULL,
+						"switch_cond_op2");
+
+		if (!useless_type_conversion_p (TREE_TYPE (high), index_type))
+		  high = fold_convert (index_type, high);
+		gimple *assign2 = gimple_build_assign (tmp2, LE_EXPR, index,
+						      high);
+
+		tree tmp3 = make_temp_ssa_name (boolean_type_node, NULL,
+						"switch_cond_and");
+		gimple *assign3 = gimple_build_assign (tmp3, BIT_AND_EXPR, tmp1,
+						      tmp2);
+
+		gsi_insert_before (&cond_gsi, assign1, GSI_SAME_STMT);
+		gsi_insert_before (&cond_gsi, assign2, GSI_SAME_STMT);
+		gsi_insert_before (&cond_gsi, assign3, GSI_SAME_STMT);
+
+		c = gimple_build_cond (NE_EXPR, tmp3, constant_boolean_node
+				       (false, boolean_type_node), NULL, NULL);
+	      }
+	    else
+	      c = gimple_build_cond (EQ_EXPR, index, low, NULL, NULL);
+
+	    gimple_set_location (c, gimple_location (stmt));
+
+	    gsi_insert_before (&cond_gsi, c, GSI_SAME_STMT);
+
+	    basic_block label_bb = label_to_block_fn
+	      (func, CASE_LABEL (label));
+	    edge new_edge = make_edge (cur_bb, label_bb, EDGE_TRUE_VALUE);
+	    int prob_sum = sum_slice <int> (edge_probabilities, i, labels) +
+	       edge_probabilities[0];
+
+	    if (prob_sum)
+	      new_edge->probability = RDIV
+		(REG_BR_PROB_BASE * edge_probabilities[i], prob_sum);
+
+	    new_edge->count = edge_counts[i];
+	    new_edges.safe_push (new_edge);
+
+	    if (i < labels - 1)
+	      {
+		/* Prepare another basic block that will contain
+		   next condition.  */
+		basic_block next_bb = create_empty_bb (cur_bb);
+		if (current_loops)
+		  {
+		    add_bb_to_loop (next_bb, cur_bb->loop_father);
+		    loops_state_set (LOOPS_NEED_FIXUP);
+		  }
+
+		edge next_edge = make_edge (cur_bb, next_bb, EDGE_FALSE_VALUE);
+		next_edge->probability = inverse_probability
+		  (new_edge->probability);
+		next_edge->count = edge_counts[0]
+		  + sum_slice <gcov_type> (edge_counts, i, labels);
+		next_bb->frequency = EDGE_FREQUENCY (next_edge);
+		cur_bb = next_bb;
+	      }
+	    else /* Link last IF statement and default label
+		    of the switch.  */
+	      {
+		edge e = make_edge (cur_bb, default_label_bb, EDGE_FALSE_VALUE);
+		e->probability = inverse_probability (new_edge->probability);
+		e->count = edge_counts[0];
+		new_edges.safe_insert (0, e);
+	      }
+	  }
+
+	  /* Restore original PHI immediate value.  */
+	  for (unsigned i = 0; i < phi_todo_list.length (); i++)
+	    {
+	      phi_definition *phi_def = phi_todo_list[i];
+	      edge new_edge = new_edges[phi_def->label_index];
+
+	      gphi_iterator it = gsi_start_phis (new_edge->dest);
+	      for (unsigned i = 0; i < phi_def->phi_index; i++)
+		gsi_next (&it);
+
+	      gphi *phi = it.phi ();
+	      add_phi_arg (phi, phi_def->phi_value, new_edge, UNKNOWN_LOCATION);
+	      delete phi_def;
+	    }
+
+	/* Remove the original GIMPLE switch statement.  */
+	gsi_remove (&gsi, true);
+      }
+  }
+
+  if (dump_file)
+    dump_function_to_file (current_function_decl, dump_file, TDF_DETAILS);
+
+  if (need_update)
+    {
+      free_dominance_info (CDI_DOMINATORS);
+      calculate_dominance_info (CDI_DOMINATORS);
+    }
+}
+
+/* Emit HSA module variables that are global for the entire module.  */
+
+static void
+emit_hsa_module_variables (void)
+{
+  hsa_num_threads = new hsa_symbol (BRIG_TYPE_U32, BRIG_SEGMENT_PRIVATE,
+				    BRIG_LINKAGE_MODULE);
+
+  hsa_num_threads->m_name = "hsa_num_threads";
+  hsa_num_threads->m_global_scope_p = true;
+
+  hsa_brig_emit_omp_symbols ();
+}
+
+/* Generate HSAIL representation of the current function and write into a
+   special section of the output file.  If KERNEL is set, the function will be
+   considered an HSA kernel callable from the host, otherwise it will be
+   compiled as an HSA function callable from other HSA code.  */
+
+static void
+generate_hsa (bool kernel)
+{
+  hsa_init_data_for_cfun ();
+
+  if (hsa_num_threads == NULL)
+    emit_hsa_module_variables ();
+
+  /* Initialize hsa_cfun.  */
+  hsa_cfun = new hsa_function_representation (cfun->decl, kernel,
+					      SSANAMES (cfun)->length ());
+  hsa_cfun->init_extra_bbs ();
+
+  if (flag_tm)
+    {
+      HSA_SORRY_AT (UNKNOWN_LOCATION,
+		    "support for HSA does not implement transactional memory");
+      goto fail;
+    }
+
+  verify_function_arguments (cfun->decl);
+  if (hsa_seen_error ())
+    goto fail;
+
+  hsa_cfun->m_name = get_brig_function_name (cfun->decl);
+
+  gen_function_def_parameters ();
+  if (hsa_seen_error ())
+    goto fail;
+
+  init_prologue ();
+
+  gen_body_from_gimple ();
+  if (hsa_seen_error ())
+    goto fail;
+
+  if (hsa_cfun->m_kernel_dispatch_count)
+    init_hsa_num_threads ();
+
+  if (hsa_cfun->m_kern_p)
+    {
+      hsa_add_kern_decl_mapping (current_function_decl, hsa_cfun->m_name,
+				 hsa_cfun->m_maximum_omp_data_size);
+    }
+
+#ifdef ENABLE_CHECKING
+  for (unsigned i = 0; i < hsa_cfun->m_ssa_map.length (); i++)
+    if (hsa_cfun->m_ssa_map[i])
+      hsa_cfun->m_ssa_map[i]->verify_ssa ();
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      hsa_bb *hbb = hsa_bb_for_bb (bb);
+
+      for (hsa_insn_basic *insn = hbb->m_first_insn; insn; insn = insn->m_next)
+	insn->verify ();
+    }
+
+#endif
+
+  hsa_regalloc ();
+  hsa_brig_emit_function ();
+
+ fail:
+  hsa_deinit_data_for_cfun ();
+}
+
+namespace {
+
+const pass_data pass_data_gen_hsail =
+{
+  GIMPLE_PASS,
+  "hsagen",	 			/* name */
+  OPTGROUP_NONE,                        /* optinfo_flags */
+  TV_NONE,				/* tv_id */
+  PROP_cfg | PROP_ssa,                  /* properties_required */
+  0,					/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  0					/* todo_flags_finish */
+};
+
+class pass_gen_hsail : public gimple_opt_pass
+{
+public:
+  pass_gen_hsail (gcc::context *ctxt)
+    : gimple_opt_pass(pass_data_gen_hsail, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate (function *);
+  unsigned int execute (function *);
+
+}; // class pass_gen_hsail
+
+/* Determine whether or not to run generation of HSAIL.  */
+
+bool
+pass_gen_hsail::gate (function *f)
+{
+  return hsa_gen_requested_p ()
+    && hsa_gpu_implementation_p (f->decl);
+}
+
+unsigned int
+pass_gen_hsail::execute (function *)
+{
+  hsa_function_summary *s = hsa_summaries->get
+    (cgraph_node::get_create (current_function_decl));
+
+  convert_switch_statements ();
+  generate_hsa (s->m_kind == HSA_KERNEL);
+  TREE_ASM_WRITTEN (current_function_decl) = 1;
+  return TODO_discard_function;
+}
+
+} // anon namespace
+
+/* Create the instance of hsa gen pass.  */
+
+gimple_opt_pass *
+make_pass_gen_hsail (gcc::context *ctxt)
+{
+  return new pass_gen_hsail (ctxt);
+}
diff --git a/gcc/hsa.c b/gcc/hsa.c
new file mode 100644
index 0000000..ab05a1d
--- /dev/null
+++ b/gcc/hsa.c
@@ -0,0 +1,735 @@
+/* Implementation of commonly needed HSAIL related functions and methods.
+   Copyright (C) 2013-15 Free Software Foundation, Inc.
+   Contributed by Martin Jambor <mjambor@suse.cz> and
+   Martin Liska <mliska@suse.cz>.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "is-a.h"
+#include "hash-set.h"
+#include "hash-map.h"
+#include "vec.h"
+#include "tree.h"
+#include "dumpfile.h"
+#include "gimple-pretty-print.h"
+#include "diagnostic-core.h"
+#include "alloc-pool.h"
+#include "cgraph.h"
+#include "print-tree.h"
+#include "symbol-summary.h"
+#include "hsa.h"
+
+/* Structure containing intermediate HSA representation of the generated
+   function. */
+class hsa_function_representation *hsa_cfun;
+
+/* Element of the mapping vector between a host decl and an HSA kernel.  */
+
+struct GTY(()) hsa_decl_kernel_map_element
+{
+  /* The decl of the host function.  */
+  tree decl;
+  /* Name of the HSA kernel in BRIG.  */
+  char * GTY((skip)) name;
+  /* Size of OMP data, if the kernel contains a kernel dispatch.  */
+  unsigned omp_data_size;
+};
+
+/* Mapping between decls and corresponding HSA kernels in this compilation
+   unit.  */
+
+static GTY (()) vec<hsa_decl_kernel_map_element, va_gc> *hsa_decl_kernel_mapping;
+
+/* Mapping between decls and corresponding HSA kernels
+   called by the function.  */
+hash_map <tree, vec <const char *> *> *hsa_decl_kernel_dependencies;
+
+/* Hash function to lookup a symbol for a decl.  */
+hash_table <hsa_free_symbol_hasher> *hsa_global_variable_symbols;
+
+/* HSA summaries.  */
+hsa_summary_t *hsa_summaries = NULL;
+
+/* HSA number of threads.  */
+hsa_symbol *hsa_num_threads = NULL;
+
+/* HSA function that cannot be expanded to HSAIL.  */
+hash_set <tree> *hsa_failed_functions = NULL;
+
+/* True if compilation unit-wide data are already allocated and initialized.  */
+static bool compilation_unit_data_initialized;
+
+/* Return true if FNDECL represents an HSA-callable function.  */
+
+bool
+hsa_callable_function_p (tree fndecl)
+{
+  return lookup_attribute ("omp declare target", DECL_ATTRIBUTES (fndecl));
+}
+
+/* Allocate HSA structures that are are used when dealing with different
+   functions.  */
+
+void
+hsa_init_compilation_unit_data (void)
+{
+  if (compilation_unit_data_initialized)
+    return;
+
+  compilation_unit_data_initialized = true;
+
+  hsa_failed_functions = new hash_set <tree> ();
+}
+
+/* Free data structures that are used when dealing with different
+   functions.  */
+
+void
+hsa_deinit_compilation_unit_data (void)
+{
+  if (hsa_failed_functions)
+    delete hsa_failed_functions;
+
+  if (hsa_num_threads)
+    {
+      delete hsa_num_threads;
+      hsa_num_threads = NULL;
+    }
+
+  compilation_unit_data_initialized = false;
+}
+
+/* Return true if we are generating large HSA machine model.  */
+
+bool
+hsa_machine_large_p (void)
+{
+  /* FIXME: I suppose this is technically wrong but should work for me now.  */
+  return (GET_MODE_BITSIZE (Pmode) == 64);
+}
+
+/* Return the HSA profile we are using.  */
+
+bool
+hsa_full_profile_p (void)
+{
+  return true;
+}
+
+/* Return true if a register in operand number OPNUM of instruction
+   is an output.  False if it is an input.  */
+
+bool
+hsa_insn_basic::op_output_p (unsigned opnum)
+{
+  switch (m_opcode)
+    {
+    case HSA_OPCODE_PHI:
+    case BRIG_OPCODE_CBR:
+    case BRIG_OPCODE_SBR:
+    case BRIG_OPCODE_ST:
+    case BRIG_OPCODE_SIGNALNORET:
+      /* FIXME: There are probably missing cases here, double check.  */
+      return false;
+    case BRIG_OPCODE_EXPAND:
+      /* Example: expand_v4_b32_b128 (dest0, dest1, dest2, dest3), src0.  */
+      return opnum < operand_count () - 1;
+    default:
+     return opnum == 0;
+    }
+}
+
+/* Return true if OPCODE is an floating-point bit instruction opcode.  */
+
+bool
+hsa_opcode_floating_bit_insn_p (BrigOpcode16_t opcode)
+{
+  switch (opcode)
+    {
+    case BRIG_OPCODE_NEG:
+    case BRIG_OPCODE_ABS:
+    case BRIG_OPCODE_CLASS:
+    case BRIG_OPCODE_COPYSIGN:
+      return true;
+    default:
+      return false;
+    }
+}
+
+/* Return the number of destination operands for this INSN.  */
+
+unsigned
+hsa_insn_basic::input_count ()
+{
+  switch (m_opcode)
+    {
+      default:
+	return 1;
+
+      case BRIG_OPCODE_NOP:
+	return 0;
+
+      case BRIG_OPCODE_EXPAND:
+	return 2;
+
+      case BRIG_OPCODE_LD:
+	/* ld_v[234] not yet handled.  */
+	return 1;
+
+      case BRIG_OPCODE_ST:
+	return 0;
+
+      case BRIG_OPCODE_ATOMICNORET:
+	return 0;
+
+      case BRIG_OPCODE_SIGNAL:
+	return 1;
+
+      case BRIG_OPCODE_SIGNALNORET:
+	return 0;
+
+      case BRIG_OPCODE_MEMFENCE:
+	return 0;
+
+      case BRIG_OPCODE_RDIMAGE:
+      case BRIG_OPCODE_LDIMAGE:
+      case BRIG_OPCODE_STIMAGE:
+      case BRIG_OPCODE_QUERYIMAGE:
+      case BRIG_OPCODE_QUERYSAMPLER:
+	sorry ("HSA image ops not handled");
+	return 0;
+
+      case BRIG_OPCODE_CBR:
+      case BRIG_OPCODE_BR:
+	return 0;
+
+      case BRIG_OPCODE_SBR:
+	return 0; /* ??? */
+
+      case BRIG_OPCODE_WAVEBARRIER:
+	return 0; /* ??? */
+
+      case BRIG_OPCODE_BARRIER:
+      case BRIG_OPCODE_ARRIVEFBAR:
+      case BRIG_OPCODE_INITFBAR:
+      case BRIG_OPCODE_JOINFBAR:
+      case BRIG_OPCODE_LEAVEFBAR:
+      case BRIG_OPCODE_RELEASEFBAR:
+      case BRIG_OPCODE_WAITFBAR:
+	return 0;
+
+      case BRIG_OPCODE_LDF:
+	return 1;
+
+      case BRIG_OPCODE_ACTIVELANECOUNT:
+      case BRIG_OPCODE_ACTIVELANEID:
+      case BRIG_OPCODE_ACTIVELANEMASK:
+      case BRIG_OPCODE_ACTIVELANEPERMUTE:
+	return 1; /* ??? */
+
+      case BRIG_OPCODE_CALL:
+      case BRIG_OPCODE_SCALL:
+      case BRIG_OPCODE_ICALL:
+	return 0;
+
+      case BRIG_OPCODE_RET:
+	return 0;
+
+      case BRIG_OPCODE_ALLOCA:
+	return 1;
+
+      case BRIG_OPCODE_CLEARDETECTEXCEPT:
+	return 0;
+
+      case BRIG_OPCODE_SETDETECTEXCEPT:
+	return 0;
+
+      case BRIG_OPCODE_PACKETCOMPLETIONSIG:
+      case BRIG_OPCODE_PACKETID:
+      case BRIG_OPCODE_CASQUEUEWRITEINDEX:
+      case BRIG_OPCODE_LDQUEUEREADINDEX:
+      case BRIG_OPCODE_LDQUEUEWRITEINDEX:
+      case BRIG_OPCODE_STQUEUEREADINDEX:
+      case BRIG_OPCODE_STQUEUEWRITEINDEX:
+	return 1; /* ??? */
+
+      case BRIG_OPCODE_ADDQUEUEWRITEINDEX:
+	return 1;
+
+      case BRIG_OPCODE_DEBUGTRAP:
+	return 0;
+
+      case BRIG_OPCODE_GROUPBASEPTR:
+      case BRIG_OPCODE_KERNARGBASEPTR:
+	return 1; /* ??? */
+
+      case HSA_OPCODE_ARG_BLOCK:
+	return 0;
+
+      case BRIG_KIND_DIRECTIVE_COMMENT:
+	return 0;
+    }
+}
+
+/* Return the number of source operands for this INSN.  */
+
+unsigned
+hsa_insn_basic::num_used_ops ()
+{
+  gcc_checking_assert (input_count () <= operand_count ());
+
+  return operand_count () - input_count ();
+}
+
+/* Set alignment to VALUE.  */
+
+void
+hsa_insn_mem::set_align (BrigAlignment8_t value)
+{
+  /* TODO: Perhaps remove this dump later on:  */
+  if (dump_file && (dump_flags & TDF_DETAILS) && value < m_align)
+    {
+      fprintf (dump_file, "Decreasing alignment to %u in instruction ", value);
+      dump_hsa_insn (dump_file, this);
+    }
+  m_align = value;
+}
+
+/* Return size of HSA type T in bits.  */
+
+unsigned
+hsa_type_bit_size (BrigType16_t t)
+{
+  switch (t)
+    {
+    case BRIG_TYPE_B1:
+      return 1;
+
+    case BRIG_TYPE_U8:
+    case BRIG_TYPE_S8:
+    case BRIG_TYPE_B8:
+      return 8;
+
+    case BRIG_TYPE_U16:
+    case BRIG_TYPE_S16:
+    case BRIG_TYPE_B16:
+    case BRIG_TYPE_F16:
+      return 16;
+
+    case BRIG_TYPE_U32:
+    case BRIG_TYPE_S32:
+    case BRIG_TYPE_B32:
+    case BRIG_TYPE_F32:
+    case BRIG_TYPE_U8X4:
+    case BRIG_TYPE_U16X2:
+    case BRIG_TYPE_S8X4:
+    case BRIG_TYPE_S16X2:
+    case BRIG_TYPE_F16X2:
+      return 32;
+
+    case BRIG_TYPE_U64:
+    case BRIG_TYPE_S64:
+    case BRIG_TYPE_F64:
+    case BRIG_TYPE_B64:
+    case BRIG_TYPE_U8X8:
+    case BRIG_TYPE_U16X4:
+    case BRIG_TYPE_U32X2:
+    case BRIG_TYPE_S8X8:
+    case BRIG_TYPE_S16X4:
+    case BRIG_TYPE_S32X2:
+    case BRIG_TYPE_F16X4:
+    case BRIG_TYPE_F32X2:
+
+      return 64;
+
+    case BRIG_TYPE_B128:
+    case BRIG_TYPE_U8X16:
+    case BRIG_TYPE_U16X8:
+    case BRIG_TYPE_U32X4:
+    case BRIG_TYPE_U64X2:
+    case BRIG_TYPE_S8X16:
+    case BRIG_TYPE_S16X8:
+    case BRIG_TYPE_S32X4:
+    case BRIG_TYPE_S64X2:
+    case BRIG_TYPE_F16X8:
+    case BRIG_TYPE_F32X4:
+    case BRIG_TYPE_F64X2:
+      return 128;
+
+    default:
+      gcc_assert (hsa_seen_error ());
+      return t;
+    }
+}
+
+/* Return BRIG bit-type with BITSIZE length.  */
+
+BrigType16_t
+hsa_bittype_for_bitsize (unsigned bitsize)
+{
+  switch (bitsize)
+    {
+    case 1:
+      return BRIG_TYPE_B1;
+    case 8:
+      return BRIG_TYPE_B8;
+    case 16:
+      return BRIG_TYPE_B16;
+    case 32:
+      return BRIG_TYPE_B32;
+    case 64:
+      return BRIG_TYPE_B64;
+    case 128:
+      return BRIG_TYPE_B128;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Return BRIG unsigned int type with BITSIZE length.  */
+
+BrigType16_t
+hsa_uint_for_bitsize (unsigned bitsize)
+{
+  switch (bitsize)
+    {
+    case 8:
+      return BRIG_TYPE_U8;
+    case 16:
+      return BRIG_TYPE_U16;
+    case 32:
+      return BRIG_TYPE_U32;
+    case 64:
+      return BRIG_TYPE_U64;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Return HSA bit-type with the same size as the type T.  */
+
+BrigType16_t
+hsa_bittype_for_type (BrigType16_t t)
+{
+  return hsa_bittype_for_bitsize (hsa_type_bit_size (t));
+}
+
+/* Return true if and only if TYPE is a floating point number type.  */
+
+bool
+hsa_type_float_p (BrigType16_t type)
+{
+  switch (type & BRIG_TYPE_BASE_MASK)
+    {
+    case BRIG_TYPE_F16:
+    case BRIG_TYPE_F32:
+    case BRIG_TYPE_F64:
+      return true;
+    default:
+      return false;
+    }
+}
+
+/* Return true if and only if TYPE is an integer number type.  */
+
+bool
+hsa_type_integer_p (BrigType16_t type)
+{
+  switch (type & BRIG_TYPE_BASE_MASK)
+    {
+    case BRIG_TYPE_U8:
+    case BRIG_TYPE_U16:
+    case BRIG_TYPE_U32:
+    case BRIG_TYPE_U64:
+    case BRIG_TYPE_S8:
+    case BRIG_TYPE_S16:
+    case BRIG_TYPE_S32:
+    case BRIG_TYPE_S64:
+      return true;
+    default:
+      return false;
+    }
+}
+
+/* Return true if and only if TYPE is an bit-type.  */
+
+bool
+hsa_btype_p (BrigType16_t type)
+{
+  switch (type & BRIG_TYPE_BASE_MASK)
+    {
+    case BRIG_TYPE_B8:
+    case BRIG_TYPE_B16:
+    case BRIG_TYPE_B32:
+    case BRIG_TYPE_B64:
+    case BRIG_TYPE_B128:
+      return true;
+    default:
+      return false;
+    }
+}
+
+
+/* Return HSA alignment encoding alignment to N bits.  */
+
+BrigAlignment8_t
+hsa_alignment_encoding (unsigned n)
+{
+  gcc_assert (n >= 8 && !(n & (n - 1)));
+  if (n >= 256)
+    return BRIG_ALIGNMENT_32;
+
+  switch (n)
+    {
+    case 8:
+      return BRIG_ALIGNMENT_1;
+    case 16:
+      return BRIG_ALIGNMENT_2;
+    case 32:
+      return BRIG_ALIGNMENT_4;
+    case 64:
+      return BRIG_ALIGNMENT_8;
+    case 128:
+      return BRIG_ALIGNMENT_16;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Return natural alignment of HSA TYPE.  */
+
+BrigAlignment8_t
+hsa_natural_alignment (BrigType16_t type)
+{
+  return hsa_alignment_encoding (hsa_type_bit_size (type & ~BRIG_TYPE_ARRAY));
+}
+
+/* Call the correct destructor of a HSA instruction.  */
+
+void
+hsa_destroy_insn (hsa_insn_basic *insn)
+{
+  if (hsa_insn_phi *phi = dyn_cast <hsa_insn_phi *> (insn))
+    phi->~hsa_insn_phi ();
+  else if (hsa_insn_br *br = dyn_cast <hsa_insn_br *> (insn))
+    br->~hsa_insn_br ();
+  else if (hsa_insn_cmp *cmp = dyn_cast <hsa_insn_cmp *> (insn))
+    cmp->~hsa_insn_cmp ();
+  else if (hsa_insn_mem *mem = dyn_cast <hsa_insn_mem *> (insn))
+    mem->~hsa_insn_mem ();
+  else if (hsa_insn_atomic *atomic = dyn_cast <hsa_insn_atomic *> (insn))
+    atomic->~hsa_insn_atomic ();
+  else if (hsa_insn_seg *seg = dyn_cast <hsa_insn_seg *> (insn))
+    seg->~hsa_insn_seg ();
+  else if (hsa_insn_call *call = dyn_cast <hsa_insn_call *> (insn))
+    call->~hsa_insn_call ();
+  else if (hsa_insn_arg_block *block = dyn_cast <hsa_insn_arg_block *> (insn))
+    block->~hsa_insn_arg_block ();
+  else if (hsa_insn_sbr *sbr = dyn_cast <hsa_insn_sbr *> (insn))
+    sbr->~hsa_insn_sbr ();
+  else if (hsa_insn_comment *comment = dyn_cast <hsa_insn_comment *> (insn))
+    comment->~hsa_insn_comment ();
+  else
+    insn->~hsa_insn_basic ();
+}
+
+/* Call the correct destructor of a HSA operand.  */
+
+void
+hsa_destroy_operand (hsa_op_base *op)
+{
+  if (hsa_op_code_list *list = dyn_cast <hsa_op_code_list *> (op))
+    list->~hsa_op_code_list ();
+  else if (hsa_op_operand_list *list = dyn_cast <hsa_op_operand_list *> (op))
+    list->~hsa_op_operand_list ();
+  else if (hsa_op_reg *reg = dyn_cast <hsa_op_reg *> (op))
+    reg->~hsa_op_reg ();
+  else if (hsa_op_immed *immed = dyn_cast <hsa_op_immed *> (op))
+    immed->~hsa_op_immed ();
+  else
+    op->~hsa_op_base ();
+}
+
+/* Create a mapping between the original function DECL and kernel name NAME.  */
+
+void
+hsa_add_kern_decl_mapping (tree decl, char *name, unsigned omp_data_size)
+{
+  hsa_decl_kernel_map_element dkm;
+  dkm.decl = decl;
+  dkm.name = name;
+  dkm.omp_data_size = omp_data_size;
+  vec_safe_push (hsa_decl_kernel_mapping, dkm);
+}
+
+/* Return the number of kernel decl name mappings.  */
+
+unsigned
+hsa_get_number_decl_kernel_mappings (void)
+{
+  return vec_safe_length (hsa_decl_kernel_mapping);
+}
+
+/* Return the decl in the Ith kernel decl name mapping.  */
+
+tree
+hsa_get_decl_kernel_mapping_decl (unsigned i)
+{
+  return (*hsa_decl_kernel_mapping)[i].decl;
+}
+
+/* Return the name in the Ith kernel decl name mapping.  */
+
+char *
+hsa_get_decl_kernel_mapping_name (unsigned i)
+{
+  return (*hsa_decl_kernel_mapping)[i].name;
+}
+
+/* Return maximum OMP size for kernel decl name mapping.  */
+
+unsigned
+hsa_get_decl_kernel_mapping_omp_size (unsigned i)
+{
+  return (*hsa_decl_kernel_mapping)[i].omp_data_size;
+}
+
+/* Free the mapping between original decls and kernel names.  */
+
+void
+hsa_free_decl_kernel_mapping (void)
+{
+  if (hsa_decl_kernel_mapping == NULL)
+    return;
+
+  for (unsigned i = 0; i < hsa_decl_kernel_mapping->length (); ++i)
+    free ((*hsa_decl_kernel_mapping)[i].name);
+  ggc_free (hsa_decl_kernel_mapping);
+}
+
+/* Add new kernel dependency.  */
+
+void
+hsa_add_kernel_dependency (tree caller, const char *called_function)
+{
+  if (hsa_decl_kernel_dependencies == NULL)
+    hsa_decl_kernel_dependencies = new hash_map<tree, vec<const char *> *> ();
+
+  vec <const char *> *s = NULL;
+  vec <const char *> **slot = hsa_decl_kernel_dependencies->get (caller);
+  if (slot == NULL)
+    {
+      s = new vec <const char *> ();
+      hsa_decl_kernel_dependencies->put (caller, s);
+    }
+  else
+    s = *slot;
+
+  s->safe_push (called_function);
+}
+
+/* Modify the name P in-place so that it is a valid HSA identifier.  */
+
+void
+hsa_sanitize_name (char *p)
+{
+  for (; *p; p++)
+    if (*p == '.' || *p == '-')
+      *p = '_';
+}
+
+/* Clone the name P, set trailing ampersand and sanitize the name.  */
+
+char *
+hsa_brig_function_name (const char *p)
+{
+  unsigned len = strlen (p);
+  char *buf = XNEWVEC (char, len + 2);
+
+  buf[0] = '&';
+  buf[len + 1] = '\0';
+  memcpy (buf + 1, p, len);
+
+  hsa_sanitize_name (buf);
+  return buf;
+}
+
+/* Return declaration name if exists.  */
+
+const char *
+hsa_get_declaration_name (tree decl)
+{
+  if (!DECL_NAME (decl))
+    {
+      char *b = XNEWVEC (char, 64);
+      sprintf (b, "__hsa_anonymous_%i", DECL_UID (decl));
+      const char *ggc_str = ggc_alloc_string (b, strlen (b) + 1);
+      free (b);
+      return ggc_str;
+    }
+  else if (TREE_CODE (decl) == FUNCTION_DECL)
+    return cgraph_node::get_create (decl)->asm_name ();
+  else
+    return IDENTIFIER_POINTER (DECL_NAME (decl));
+
+  return NULL;
+}
+
+/* Add a HOST function to HSA summaries.  */
+
+void
+hsa_register_kernel (cgraph_node *host)
+{
+  if (hsa_summaries == NULL)
+    hsa_summaries = new hsa_summary_t (symtab);
+  hsa_function_summary *s = hsa_summaries->get (host);
+  s->m_kind = HSA_KERNEL;
+}
+
+/* Add a pair of functions to HSA summaries.  GPU is an HSA implementation of
+   a HOST function.  */
+
+void
+hsa_register_kernel (cgraph_node *gpu, cgraph_node *host)
+{
+  if (hsa_summaries == NULL)
+    hsa_summaries = new hsa_summary_t (symtab);
+  hsa_summaries->link_functions (gpu, host, HSA_KERNEL);
+}
+
+/* Return true if expansion of the current HSA function has already failed.  */
+
+bool
+hsa_seen_error (void)
+{
+  return hsa_cfun->m_seen_error;
+}
+
+/* Mark current HSA function as failed.  */
+
+void
+hsa_fail_cfun (void)
+{
+  hsa_failed_functions->add (hsa_cfun->m_decl);
+  hsa_cfun->m_seen_error = true;
+}
+
+#include "gt-hsa.h"
diff --git a/gcc/hsa.h b/gcc/hsa.h
new file mode 100644
index 0000000..025de67
--- /dev/null
+++ b/gcc/hsa.h
@@ -0,0 +1,1274 @@
+/* HSAIL and BRIG related macros and definitions.
+   Copyright (C) 2013-15 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef HSA_H
+#define HSA_H
+
+#include "hsa-brig-format.h"
+#include "is-a.h"
+#include "predict.h"
+#include "tree.h"
+#include "vec.h"
+#include "hash-table.h"
+#include "basic-block.h"
+
+
+/* Return true if the compiler should produce HSAIL.  */
+
+static inline bool
+hsa_gen_requested_p (void)
+{
+#ifndef ENABLE_HSA
+  return false;
+#endif
+  return !flag_disable_hsa;
+}
+
+/* Standard warning message if we failed to generate HSAIL for a function */
+
+#define HSA_SORRY_MSG "could not emit HSAIL for the function"
+
+class hsa_op_immed;
+class hsa_op_cst_list;
+class hsa_insn_basic;
+class hsa_op_address;
+class hsa_op_reg;
+class hsa_bb;
+typedef hsa_insn_basic *hsa_insn_basic_p;
+
+/* Class representing an input argument, output argument (result) or a
+   variable, that will eventually end up being a symbol directive.  */
+
+struct hsa_symbol
+{
+  /* Constructor.  */
+  hsa_symbol (BrigType16_t type, BrigSegment8_t segment,
+	      BrigLinkage8_t linkage);
+
+  /* Return total size of the symbol.  */
+  unsigned HOST_WIDE_INT total_byte_size ();
+
+  /* Fill in those values into the symbol according to DECL, which are
+     determined independently from whether it is parameter, result,
+     or a variable, local or global.  */
+  void fillup_for_decl (tree decl);
+
+  /* Pointer to the original tree, which is PARM_DECL for input parameters and
+     RESULT_DECL for the output parameters.  */
+  tree m_decl;
+
+  /* Name of the symbol, that will be written into output and dumps.  Can be
+     NULL, see name_number below.*/
+  const char *m_name;
+
+  /* If name is NULL, artificial name will be formed from the segment name and
+     this number.  */
+  int m_name_number;
+
+  /* Once written, this is the offset of the associated symbol directive.  Zero
+     means the symbol has not been written yet.  */
+  unsigned m_directive_offset;
+
+  /* HSA type of the parameter.  */
+  BrigType16_t m_type;
+
+  /* The HSA segment this will eventually end up in.  */
+  BrigSegment8_t m_segment;
+
+  /* The HSA kind of linkage.  */
+  BrigLinkage8_t m_linkage;
+
+  /* Array dimension, if non-zero.  */
+  unsigned HOST_WIDE_INT m_dim;
+
+  /* Constant value, used for string constants.  */
+  hsa_op_immed *m_cst_value;
+
+  /* Is in global scope.  */
+  bool m_global_scope_p;
+
+  /* True if an error has been seen for the symbol.  */
+  bool m_seen_error;
+
+private:
+  /* Default constructor.  */
+  hsa_symbol ();
+};
+
+/* Abstract class for HSA instruction operands. */
+
+class hsa_op_base
+{
+public:
+  /* Next operand scheduled to be written when writing BRIG operand
+     section.  */
+  hsa_op_base *m_next;
+
+  /* Offset to which the associated operand structure will be written.  Zero if
+     yet not scheduled for writing.  */
+  unsigned m_brig_op_offset;
+
+  /* The type of a particular operand.  */
+  BrigKind16_t m_kind;
+
+protected:
+  hsa_op_base (BrigKind16_t k);
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_op_base () {}
+};
+
+/* Common abstract ancestor for operands which have a type.  */
+
+class hsa_op_with_type : public hsa_op_base
+{
+public:
+  /* The type.  */
+  BrigType16_t m_type;
+
+  /* Convert an operand to a destination type DTYPE and attach insns
+     to HBB if needed.  */
+  hsa_op_with_type *get_in_type (BrigType16_t dtype, hsa_bb *hbb);
+
+protected:
+  hsa_op_with_type (BrigKind16_t k, BrigType16_t t);
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_op_with_type () : hsa_op_base (BRIG_KIND_NONE) {}
+};
+
+/* An immediate HSA operand.  */
+
+class hsa_op_immed : public hsa_op_with_type
+{
+public:
+  hsa_op_immed (tree tree_val, bool min32int = true);
+  hsa_op_immed (HOST_WIDE_INT int_value, BrigKind16_t type);
+  void *operator new (size_t);
+  ~hsa_op_immed ();
+  void set_type (BrigKind16_t t);
+
+  /* Value as represented by middle end.  */
+  tree m_tree_value;
+
+  /* Integer value representation.  */
+  HOST_WIDE_INT m_int_value;
+
+  /* Brig data representation.  */
+  char *m_brig_repr;
+
+  /* Brig data representation size in bytes.  */
+  unsigned m_brig_repr_size;
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_op_immed ();
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+  void emit_to_buffer (tree value);
+};
+
+/* Report whether or not P is a an immediate operand.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_op_immed *>::test (hsa_op_base *p)
+{
+  return p->m_kind == BRIG_KIND_OPERAND_CONSTANT_BYTES;
+}
+
+/* HSA register operand.  */
+
+class hsa_op_reg : public hsa_op_with_type
+{
+  friend class hsa_insn_basic;
+  friend class hsa_insn_phi;
+public:
+  hsa_op_reg (BrigType16_t t);
+  void *operator new (size_t);
+
+  /* Verify register operand.  */
+  void verify_ssa ();
+
+  /* If NON-NULL, gimple SSA that we come from.  NULL if none.  */
+  tree m_gimple_ssa;
+
+  /* Defining instruction while still in the SSA.  */
+  hsa_insn_basic *m_def_insn;
+
+  /* If the register allocator decides to spill the register, this is the
+     appropriate spill symbol.  */
+  hsa_symbol *m_spill_sym;
+
+  /* Number of this register structure in the order in which they were
+     allocated.  */
+  int m_order;
+  int m_lr_begin, m_lr_end;
+
+  /* Zero if the register is not yet allocated.  After, allocation, this must
+     be 'c', 's', 'd' or 'q'.  */
+  char m_reg_class;
+  /* If allocated, the number of the HW register (within its HSA register
+     class). */
+  char m_hard_num;
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_op_reg () : hsa_op_with_type (BRIG_KIND_NONE, BRIG_TYPE_NONE) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+  /* Set definition where the register is defined.  */
+  void set_definition (hsa_insn_basic *insn);
+  /* Uses of the value while still in SSA.  */
+  auto_vec <hsa_insn_basic_p> m_uses;
+};
+
+typedef class hsa_op_reg *hsa_op_reg_p;
+
+/* Report whether or not P is a register operand.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_op_reg *>::test (hsa_op_base *p)
+{
+  return p->m_kind == BRIG_KIND_OPERAND_REGISTER;
+}
+
+/* Report whether or not P is a register operand.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_op_reg *>::test (hsa_op_with_type *p)
+{
+  return p->m_kind == BRIG_KIND_OPERAND_REGISTER;
+}
+
+/* An address HSA operand.  */
+
+class hsa_op_address : public hsa_op_base
+{
+public:
+  /* set up a new address operand consisting of base symbol SYM, register R and
+     immediate OFFSET.  If the machine model is not large and offset is 64 bit,
+     the upper, 32 bits have to be zero.  */
+  hsa_op_address (hsa_symbol *sym, hsa_op_reg *reg,
+		  HOST_WIDE_INT offset = 0);
+
+  void *operator new (size_t);
+
+  /* Set up a new address operand consisting of base symbol SYM and
+     immediate OFFSET.  If the machine model is not large and offset is 64 bit,
+     the upper, 32 bits have to be zero.  */
+  hsa_op_address (hsa_symbol *sym, HOST_WIDE_INT offset = 0);
+
+  /* Set up a new address operand consisting of register R and
+     immediate OFFSET.  If the machine model is not large and offset is 64 bit,
+     the upper, 32 bits have to be zero.  */
+  hsa_op_address (hsa_op_reg *reg, HOST_WIDE_INT offset = 0);
+
+  /* Symbol base of the address.  Can be NULL if there is none.  */
+  hsa_symbol *m_symbol;
+
+  /* Register offset.  Can be NULL if there is none.  */
+  hsa_op_reg *m_reg;
+
+  /* Immediate byte offset.  */
+  HOST_WIDE_INT m_imm_offset;
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_op_address () : hsa_op_base (BRIG_KIND_NONE) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is an address operand.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_op_address *>::test (hsa_op_base *p)
+{
+  return p->m_kind == BRIG_KIND_OPERAND_ADDRESS;
+}
+
+/* A reference to code HSA operand. It can be either reference
+   to a start of a BB or a start of a function.  */
+
+class hsa_op_code_ref : public hsa_op_base
+{
+public:
+  hsa_op_code_ref ();
+
+  /* Offset in the code section that this refers to.  */
+  unsigned m_directive_offset;
+};
+
+/* Report whether or not P is a code reference operand.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_op_code_ref *>::test (hsa_op_base *p)
+{
+  return p->m_kind == BRIG_KIND_OPERAND_CODE_REF;
+}
+
+/* Code list HSA operand.  */
+
+class hsa_op_code_list: public hsa_op_base
+{
+public:
+  hsa_op_code_list (unsigned elements);
+  void *operator new (size_t);
+
+  /* Offset to variable-sized array in hsa_data section, where
+     are offsets to entries in the hsa_code section.  */
+  auto_vec<unsigned> m_offsets;
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_op_code_list () : hsa_op_base (BRIG_KIND_NONE) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a code list operand.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_op_code_list *>::test (hsa_op_base *p)
+{
+  return p->m_kind == BRIG_KIND_OPERAND_CODE_LIST;
+}
+
+/* Operand list HSA operand.  */
+
+class hsa_op_operand_list: public hsa_op_base
+{
+public:
+  hsa_op_operand_list (unsigned elements);
+  ~hsa_op_operand_list ();
+  void *operator new (size_t);
+
+  /* Offset to variable-sized array in hsa_data section, where
+     are offsets to entries in the hsa_code section.  */
+  auto_vec<unsigned> m_offsets;
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_op_operand_list () : hsa_op_base (BRIG_KIND_NONE) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a code list operand.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_op_operand_list *>::test (hsa_op_base *p)
+{
+  return p->m_kind == BRIG_KIND_OPERAND_OPERAND_LIST;
+}
+
+/* Opcodes of instructions that are not part of HSA but that we use to
+   represent it nevertheless.  */
+
+#define HSA_OPCODE_PHI (-1)
+#define HSA_OPCODE_ARG_BLOCK (-2)
+
+/* The number of operand pointers we can directly in an instruction.  */
+#define HSA_BRIG_INT_STORAGE_OPERANDS 5
+
+/* Class representing an HSA instruction.  Unlike typical ancestors for
+   specialized classes, this one is also directly used for all instructions
+   that are then represented as BrigInstBasic.  */
+
+class hsa_insn_basic
+{
+public:
+  hsa_insn_basic (unsigned nops, int opc);
+  hsa_insn_basic (unsigned nops, int opc, BrigType16_t t,
+		  hsa_op_base *arg0 = NULL,
+		  hsa_op_base *arg1 = NULL,
+		  hsa_op_base *arg2 = NULL,
+		  hsa_op_base *arg3 = NULL);
+
+  void *operator new (size_t);
+  void set_op (int index, hsa_op_base *op);
+  hsa_op_base *get_op (int index);
+  hsa_op_base **get_op_addr (int index);
+  unsigned int operand_count ();
+  void verify ();
+  unsigned input_count ();
+  unsigned num_used_ops ();
+  void set_output_in_type (hsa_op_reg *dest, unsigned op_index, hsa_bb *hbb);
+  bool op_output_p (unsigned opnum);
+
+  /* The previous and next instruction in the basic block.  */
+  hsa_insn_basic *m_prev, *m_next;
+
+  /* Basic block this instruction belongs to.  */
+  basic_block m_bb;
+
+  /* Operand code distinguishing different types of instructions.  Eventually
+     these should only be BRIG_INST_* values from the BrigOpcode16_t range but
+     initially we use negative values for PHI nodes and such.  */
+  int m_opcode;
+
+  /* Linearized number assigned to the instruction by HSA RA.  */
+  int m_number;
+
+  /* Type of the destination of the operations.  */
+  BrigType16_t m_type;
+
+  /* BRIG offset of the instruction in code section.  */
+  unsigned int m_brig_offset;
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_basic () {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+  /* The individual operands.  All instructions but PHI nodes have five or
+     fewer instructions and so will fit the internal storage.  */
+  /* TODO: Vast majority of instructions have three or fewer operands, so we
+     may actually try reducing it.  */
+  auto_vec<hsa_op_base *, HSA_BRIG_INT_STORAGE_OPERANDS> m_operands;
+};
+
+/* Class representing a PHI node of the SSA form of HSA virtual
+   registers.  */
+
+class hsa_insn_phi : public hsa_insn_basic
+{
+public:
+  hsa_insn_phi (unsigned nops, hsa_op_reg *dst);
+
+  void *operator new (size_t);
+
+  /* Destination.  */
+  hsa_op_reg *m_dest;
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_phi () : hsa_insn_basic (1, HSA_OPCODE_PHI) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a PHI node.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_phi *>::test (hsa_insn_basic *p)
+{
+  return p->m_opcode == HSA_OPCODE_PHI;
+}
+
+/* HSA instruction for branches.  Currently we explicitely represent only
+   conditional branches.  */
+
+class hsa_insn_br : public hsa_insn_basic
+{
+public:
+  hsa_insn_br (hsa_op_reg *ctrl);
+
+  void *operator new (size_t);
+
+  /* Width as described in HSA documentation.  */
+  BrigWidth8_t m_width;
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_br () : hsa_insn_basic (1, BRIG_OPCODE_CBR) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether P is a branching instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_br *>::test (hsa_insn_basic *p)
+{
+  return p->m_opcode == BRIG_OPCODE_BR
+    || p->m_opcode == BRIG_OPCODE_CBR;
+}
+
+/* HSA instruction for switch branches.  */
+
+class hsa_insn_sbr : public hsa_insn_basic
+{
+public:
+  hsa_insn_sbr (hsa_op_reg *index, unsigned jump_count);
+
+  /* Default destructor.  */
+  ~hsa_insn_sbr ();
+
+  void *operator new (size_t);
+
+  void replace_all_labels (basic_block old_bb, basic_block new_bb);
+
+  /* Width as described in HSA documentation.  */
+  BrigWidth8_t m_width;
+
+  /* Jump table.  */
+  vec <basic_block> m_jump_table;
+
+  /* Default label basic block.  */
+  basic_block m_default_bb;
+
+  /* Code list for label references.  */
+  hsa_op_code_list *m_label_code_list;
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_sbr () : hsa_insn_basic (1, BRIG_OPCODE_SBR) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether P is a switch branching instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_sbr *>::test (hsa_insn_basic *p)
+{
+  return p->m_opcode == BRIG_OPCODE_SBR;
+}
+
+/* HSA instruction for comparisons.  */
+
+class hsa_insn_cmp : public hsa_insn_basic
+{
+public:
+  hsa_insn_cmp (BrigCompareOperation8_t cmp, BrigType16_t t,
+		hsa_op_base *arg0 = NULL, hsa_op_base *arg1 = NULL,
+		hsa_op_base *arg2 = NULL);
+
+  void *operator new (size_t);
+
+  /* Source type should be derived from operand types.  */
+
+  /* The comparison operation.  */
+  BrigCompareOperation8_t m_compare;
+
+  /* TODO: Modifiers and packing control are missing but so are everywhere
+     else.  */
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_cmp () : hsa_insn_basic (1, BRIG_OPCODE_CMP) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a comparison instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_cmp *>::test (hsa_insn_basic *p)
+{
+  return p->m_opcode == BRIG_OPCODE_CMP;
+}
+
+/* HSA instruction for memory operations.  */
+
+class hsa_insn_mem : public hsa_insn_basic
+{
+public:
+  hsa_insn_mem (int opc, BrigType16_t t, hsa_op_base *arg0, hsa_op_base *arg1);
+
+  void *operator new (size_t);
+
+  /* Set alignment to VALUE.  */
+
+  void set_align (BrigAlignment8_t value);
+
+  /* The segment is of the memory access is either the segment of the symbol in
+     the address operand or flat address is there is no symbol there.  */
+
+  /* Required alignment of the memory operation. */
+  BrigAlignment8_t m_align;
+
+  /* HSA equiv class, basically an alias set number. */
+  uint8_t m_equiv_class;
+
+  /* TODO:  Add width modifier, perhaps also other things.  */
+protected:
+  hsa_insn_mem (unsigned nops, int opc, BrigType16_t t,
+		hsa_op_base *arg0 = NULL, hsa_op_base *arg1 = NULL,
+		hsa_op_base *arg2 = NULL, hsa_op_base *arg3 = NULL);
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_mem () : hsa_insn_basic (1, BRIG_OPCODE_LD) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a memory instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_mem *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_LD
+	  || p->m_opcode == BRIG_OPCODE_ST);
+}
+
+/* HSA instruction for atomic operations.  */
+
+class hsa_insn_atomic : public hsa_insn_mem
+{
+public:
+  hsa_insn_atomic (int nops, int opc, enum BrigAtomicOperation aop,
+		   BrigType16_t t, BrigMemoryOrder memorder,
+		   hsa_op_base *arg0 = NULL, hsa_op_base *arg1 = NULL,
+		   hsa_op_base *arg2 = NULL, hsa_op_base *arg3 = NULL);
+  void *operator new (size_t);
+
+  /* The operation itself.  */
+  enum BrigAtomicOperation m_atomicop;
+
+  /* Things like acquire/release/aligned.  */
+  enum BrigMemoryOrder m_memoryorder;
+
+  /* Scope of the atomic operation. */
+  enum BrigMemoryScope m_memoryscope;
+
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_atomic () : hsa_insn_mem (1, BRIG_KIND_NONE, BRIG_TYPE_NONE) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is an atomic instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_atomic *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_ATOMIC
+	  || p->m_opcode == BRIG_OPCODE_ATOMICNORET);
+}
+
+/* HSA instruction for signal operations.  */
+
+class hsa_insn_signal : public hsa_insn_atomic
+{
+public:
+  hsa_insn_signal (int nops, int opc, enum BrigAtomicOperation sop,
+		   BrigType16_t t, hsa_op_base *arg0 = NULL,
+		   hsa_op_base *arg1 = NULL,
+		   hsa_op_base *arg2 = NULL, hsa_op_base *arg3 = NULL);
+
+  void *operator new (size_t);
+
+private:
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a signal instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_signal *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_SIGNAL
+	  || p->m_opcode == BRIG_OPCODE_SIGNALNORET);
+}
+
+/* HSA instruction to convert between flat addressing and segments.  */
+
+class hsa_insn_seg : public hsa_insn_basic
+{
+public:
+  hsa_insn_seg (int opc, BrigType16_t destt, BrigType16_t srct,
+		BrigSegment8_t seg, hsa_op_base *arg0, hsa_op_base *arg1);
+
+  void *operator new (size_t);
+
+  /* Source type.  Depends on the source addressing/segment.  */
+  BrigType16_t m_src_type;
+  /* The segment we are converting from or to.  */
+  BrigSegment8_t m_segment;
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_seg () : hsa_insn_basic (1, BRIG_OPCODE_STOF) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a segment conversion instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_seg *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_STOF
+	  || p->m_opcode == BRIG_OPCODE_FTOS);
+}
+
+/* HSA instruction for function call.  */
+
+class hsa_insn_call : public hsa_insn_basic
+{
+public:
+  hsa_insn_call (tree callee);
+
+  /* Default destructor.  */
+  ~hsa_insn_call ();
+
+  void *operator new (size_t);
+
+  /* Called function */
+  tree m_called_function;
+
+  /* Input formal arguments.  */
+  auto_vec <hsa_symbol *> m_input_args;
+
+  /* Input arguments store instructions.  */
+  auto_vec <hsa_insn_mem *> m_input_arg_insns;
+
+  /* Output argument, can be NULL for void functions.  */
+  hsa_symbol *m_output_arg;
+
+  /* Called function code reference.  */
+  hsa_op_code_ref m_func;
+
+  /* Code list for arguments of the function.  */
+  hsa_op_code_list *m_args_code_list;
+
+  /* Code list for result of the function.  */
+  hsa_op_code_list *m_result_code_list;
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_insn_call () : hsa_insn_basic (0, BRIG_OPCODE_CALL) {}
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a call instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_call *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_CALL);
+}
+
+/* HSA call instruction block encapsulates definition of arguments,
+   result type, corresponding loads and a possible store.
+   Moreover, it contains a single call instruction.
+   Emission of the instruction will produce multiple
+   HSAIL instructions.  */
+
+class hsa_insn_arg_block : public hsa_insn_basic
+{
+public:
+  hsa_insn_arg_block (BrigKind brig_kind, hsa_insn_call * call);
+
+  void *operator new (size_t);
+
+  /* Kind of argument block.  */
+  BrigKind m_kind;
+
+  /* Call instruction.  */
+  hsa_insn_call *m_call_insn;
+private:
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Report whether or not P is a call block instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_arg_block *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == HSA_OPCODE_ARG_BLOCK);
+}
+
+/* HSA comment instruction.  */
+
+class hsa_insn_comment: public hsa_insn_basic
+{
+public:
+  /* Constructor of class representing the comment in HSAIL.  */
+  hsa_insn_comment (const char *s);
+
+  /* Default destructor.  */
+  ~hsa_insn_comment ();
+
+  void *operator new (size_t);
+
+  char *m_comment;
+};
+
+/* Report whether or not P is a call block instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_comment *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_KIND_DIRECTIVE_COMMENT);
+}
+
+/* HSA queue instruction.  */
+
+class hsa_insn_queue: public hsa_insn_basic
+{
+public:
+  hsa_insn_queue (int nops, BrigOpcode opcode);
+
+  /* Destructor.  */
+  ~hsa_insn_queue ();
+};
+
+/* Report whether or not P is a queue instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_queue *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_ADDQUEUEWRITEINDEX);
+}
+
+/* HSA packed instruction.  */
+
+class hsa_insn_packed : public hsa_insn_basic
+{
+public:
+  hsa_insn_packed (int nops, BrigOpcode opcode, BrigType16_t destt,
+		   BrigType16_t srct, hsa_op_base *arg0, hsa_op_base *arg1,
+		   hsa_op_base *arg2);
+
+  /* Pool allocator.  */
+  void *operator new (size_t);
+
+  /* Source type.  */
+  BrigType16_t m_source_type;
+
+  /* Operand list for an operand of the instruction.  */
+  hsa_op_operand_list *m_operand_list;
+
+  /* Destructor.  */
+  ~hsa_insn_packed ();
+};
+
+/* Report whether or not P is a combine instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_packed *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_COMBINE
+	  || p->m_opcode == BRIG_OPCODE_EXPAND);
+}
+
+/* HSA convert instruction.  */
+
+class hsa_insn_cvt: public hsa_insn_basic
+{
+public:
+  hsa_insn_cvt (hsa_op_with_type *dest, hsa_op_with_type *src);
+
+  /* Pool allocator.  */
+  void *operator new (size_t);
+};
+
+/* Report whether or not P is a convert instruction.  */
+
+template <>
+template <>
+inline bool
+is_a_helper <hsa_insn_cvt *>::test (hsa_insn_basic *p)
+{
+  return (p->m_opcode == BRIG_OPCODE_CVT);
+}
+
+/* Basic block of HSA instructions.  */
+
+class hsa_bb
+{
+public:
+  hsa_bb (basic_block cfg_bb);
+  hsa_bb (basic_block cfg_bb, int idx);
+  ~hsa_bb ();
+
+  /* Append an instruction INSN into the basic block.  */
+  void append_insn (hsa_insn_basic *insn);
+
+  /* The real CFG BB that this HBB belongs to.  */
+  basic_block m_bb;
+
+  /* The operand that refers to the label to this BB.  */
+  hsa_op_code_ref m_label_ref;
+
+  /* The first and last instruction.  */
+  hsa_insn_basic *m_first_insn, *m_last_insn;
+  /* The first and last phi node.  */
+  hsa_insn_phi *m_first_phi, *m_last_phi;
+
+  /* Just a number to construct names from.  */
+  int m_index;
+
+  bitmap m_liveout, m_livein;
+private:
+  /* Make the default constructor inaccessible.  */
+  hsa_bb ();
+  /* All objects are deallocated by destroying their pool, so make delete
+     inaccessible too.  */
+  void operator delete (void *) {}
+};
+
+/* Return the corresponding HSA basic block structure for the given control
+   flow basic_block BB.  */
+
+static inline hsa_bb *
+hsa_bb_for_bb (basic_block bb)
+{
+  return (struct hsa_bb *) bb->aux;
+}
+
+/* Class for hashing local hsa_symbols.  */
+
+struct hsa_noop_symbol_hasher : nofree_ptr_hash <hsa_symbol>
+{
+  static inline hashval_t hash (const value_type);
+  static inline bool equal (const value_type, const compare_type);
+};
+
+/* Hash hsa_symbol.  */
+
+inline hashval_t
+hsa_noop_symbol_hasher::hash (const value_type item)
+{
+  return DECL_UID (item->m_decl);
+}
+
+/* Return true if the DECL_UIDs of decls both symbols refer to  are equal.  */
+
+inline bool
+hsa_noop_symbol_hasher::equal (const value_type a, const compare_type b)
+{
+  return (DECL_UID (a->m_decl) == DECL_UID (b->m_decl));
+}
+
+/* Class for hashing global hsa_symbols.  */
+
+struct hsa_free_symbol_hasher : free_ptr_hash <hsa_symbol>
+{
+  static inline hashval_t hash (const value_type);
+  static inline bool equal (const value_type, const compare_type);
+};
+
+/* Hash hsa_symbol.  */
+
+inline hashval_t
+hsa_free_symbol_hasher::hash (const value_type item)
+{
+  return DECL_UID (item->m_decl);
+}
+
+/* Return true if the DECL_UIDs of decls both symbols refer to  are equal.  */
+
+inline bool
+hsa_free_symbol_hasher::equal (const value_type a, const compare_type b)
+{
+  return (DECL_UID (a->m_decl) == DECL_UID (b->m_decl));
+}
+
+/* Structure that encapsulates intermediate representation of a HSA
+   function.  */
+
+class hsa_function_representation
+{
+public:
+  hsa_function_representation (tree fdecl, bool kernel_p,
+			       unsigned ssa_names_count);
+  ~hsa_function_representation ();
+
+  /* Builds a shadow register that is utilized to a kernel dispatch.  */
+  hsa_op_reg *get_shadow_reg ();
+
+  /* Return true if we are in a function that has kernel dispatch
+     shadow register.  */
+  bool has_shadow_reg_p ();
+
+  /* The entry/exit blocks don't contain incoming code,
+     but the HSA generator might use them to put code into,
+     so we need hsa_bb instances of them.  */
+  void init_extra_bbs ();
+
+  /* Create a private symbol of requested TYPE.  */
+  hsa_symbol *create_hsa_temporary (BrigType16_t type);
+
+  /* Lookup or create a HSA pseudo register for a given gimple SSA name.  */
+  hsa_op_reg *reg_for_gimple_ssa (tree ssa);
+
+  /* Name of the function.  */
+  char *m_name;
+
+  /* Number of allocated register structures.  */
+  int m_reg_count;
+
+  /* Input arguments.  */
+  vec <hsa_symbol *> m_input_args;
+
+  /* Output argument or NULL if there is none.  */
+  hsa_symbol *m_output_arg;
+
+  /* Hash table of local variable symbols.  */
+  hash_table <hsa_noop_symbol_hasher> *m_local_symbols;
+
+  /* Hash map for string constants.  */
+  hash_map <tree, hsa_symbol *> m_string_constants_map;
+
+  /* Vector of pointers to spill symbols.  */
+  vec <struct hsa_symbol *> m_spill_symbols;
+
+  /* Vector of pointers to symbols (string constants and global,
+     non-addressable variables with a constructor).  */
+  vec <struct hsa_symbol *> m_readonly_variables;
+
+  /* Private function artificial variables.  */
+  vec <struct hsa_symbol *> m_private_variables;
+
+  /* Vector of called function declarations.  */
+  vec <tree> m_called_functions;
+
+  /* Number of HBB BBs.  */
+  int m_hbb_count;
+
+  /* Whether or not we could check and enforce SSA properties.  */
+  bool m_in_ssa;
+
+  /* True if the function is kernel function.  */
+  bool m_kern_p;
+
+  /* True if the function representation is a declaration.  */
+  bool m_declaration_p;
+
+  /* Function declaration tree.  */
+  tree m_decl;
+
+  /* Runtime shadow register.  */
+  hsa_op_reg *m_shadow_reg;
+
+  /* Number of kernel dispatched which take place in the function.  */
+  unsigned m_kernel_dispatch_count;
+
+  /* If the function representation contains a kernel dispatch,
+     OMP data size is necessary memory that is used for copying before
+     a kernel dispatch.  */
+  unsigned m_maximum_omp_data_size;
+
+  /* Return true if there's an HSA-specific warning already seen.  */
+  bool m_seen_error;
+
+  /* Counter for temporary symbols created in the function representation.  */
+  unsigned m_temp_symbol_count;
+
+  /* SSA names mapping.  */
+  vec <hsa_op_reg_p> m_ssa_map;
+};
+
+enum hsa_function_kind
+{
+  HSA_NONE,
+  HSA_KERNEL,
+  HSA_FUNCTION
+};
+
+struct hsa_function_summary
+{
+  /* Default constructor.  */
+  hsa_function_summary ();
+
+  /* Kind of GPU/host function.  */
+  hsa_function_kind m_kind;
+
+  /* Pointer to a cgraph node which is a HSA implementation of the function.
+     In case of the function is a HSA function, the binded function points
+     to the host function.  */
+  cgraph_node *m_binded_function;
+
+  /* Identifies if the function is an HSA function or a host function.  */
+  bool m_gpu_implementation_p;
+};
+
+inline
+hsa_function_summary::hsa_function_summary (): m_kind (HSA_NONE),
+  m_binded_function (NULL), m_gpu_implementation_p (false)
+{
+}
+
+/* Function summary for HSA functions.  */
+class hsa_summary_t: public function_summary <hsa_function_summary *>
+{
+public:
+  hsa_summary_t (symbol_table *table):
+    function_summary<hsa_function_summary *> (table) { }
+
+  void link_functions (cgraph_node *gpu, cgraph_node *host,
+		       hsa_function_kind kind);
+};
+
+inline void
+hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
+			       hsa_function_kind kind)
+{
+  hsa_function_summary *gpu_summary = get (gpu);
+  hsa_function_summary *host_summary = get (host);
+
+  gpu_summary->m_kind = kind;
+  host_summary->m_kind = kind;
+
+  gpu_summary->m_gpu_implementation_p = true;
+  host_summary->m_gpu_implementation_p = false;
+
+  gpu_summary->m_binded_function = host;
+  host_summary->m_binded_function = gpu;
+}
+
+/* in hsa.c */
+extern struct hsa_function_representation *hsa_cfun;
+extern hash_map <tree, vec <const char *> *> *hsa_decl_kernel_dependencies;
+extern hsa_summary_t *hsa_summaries;
+extern hsa_symbol *hsa_num_threads;
+extern unsigned hsa_kernel_calls_counter;
+extern hash_set <tree> *hsa_failed_functions;
+
+bool hsa_callable_function_p (tree fndecl);
+void hsa_init_compilation_unit_data (void);
+void hsa_deinit_compilation_unit_data (void);
+bool hsa_machine_large_p (void);
+bool hsa_full_profile_p (void);
+bool hsa_opcode_floating_bit_insn_p (BrigOpcode16_t);
+unsigned hsa_type_bit_size (BrigType16_t t);
+BrigType16_t hsa_bittype_for_bitsize (unsigned bitsize);
+BrigType16_t hsa_uint_for_bitsize (unsigned bitsize);
+BrigType16_t hsa_bittype_for_type (BrigType16_t t);
+bool hsa_type_float_p (BrigType16_t type);
+bool hsa_type_integer_p (BrigType16_t type);
+bool hsa_btype_p (BrigType16_t type);
+BrigAlignment8_t hsa_alignment_encoding (unsigned n);
+BrigAlignment8_t hsa_natural_alignment (BrigType16_t type);
+void hsa_destroy_operand (hsa_op_base *op);
+void hsa_destroy_insn (hsa_insn_basic *insn);
+void hsa_add_kern_decl_mapping (tree decl, char *name, unsigned);
+unsigned hsa_get_number_decl_kernel_mappings (void);
+tree hsa_get_decl_kernel_mapping_decl (unsigned i);
+char *hsa_get_decl_kernel_mapping_name (unsigned i);
+unsigned hsa_get_decl_kernel_mapping_omp_size (unsigned i);
+void hsa_free_decl_kernel_mapping (void);
+void hsa_add_kernel_dependency (tree caller, const char *called_function);
+void hsa_sanitize_name (char *p);
+char *hsa_brig_function_name (const char *p);
+const char *hsa_get_declaration_name (tree decl);
+void hsa_register_kernel (cgraph_node *host);
+void hsa_register_kernel (cgraph_node *gpu, cgraph_node *host);
+bool hsa_seen_error (void);
+void hsa_fail_cfun (void);
+
+/* In hsa-gen.c.  */
+void hsa_build_append_simple_mov (hsa_op_reg *, hsa_op_base *, hsa_bb *);
+hsa_symbol *hsa_get_spill_symbol (BrigType16_t);
+hsa_symbol *hsa_get_string_cst_symbol (BrigType16_t);
+hsa_op_reg *hsa_spill_in (hsa_insn_basic *, hsa_op_reg *, hsa_op_reg **);
+hsa_op_reg *hsa_spill_out (hsa_insn_basic *, hsa_op_reg *, hsa_op_reg **);
+hsa_bb *hsa_init_new_bb (basic_block);
+hsa_function_representation *hsa_generate_function_declaration (tree decl);
+tree hsa_get_host_function (tree decl);
+
+/* In hsa-regalloc.c.  */
+void hsa_regalloc (void);
+
+/* In hsa-brig.c.  */
+void hsa_brig_emit_function (void);
+void hsa_output_brig (void);
+BrigType16_t bittype_for_type (BrigType16_t t);
+unsigned hsa_get_imm_brig_type_len (BrigType16_t type);
+void hsa_brig_emit_omp_symbols (void);
+
+/*  In hsa-dump.c.  */
+const char *hsa_seg_name (BrigSegment8_t);
+void dump_hsa_insn (FILE *f, hsa_insn_basic *insn);
+void dump_hsa_bb (FILE *, hsa_bb *);
+void dump_hsa_cfun (FILE *);
+DEBUG_FUNCTION void debug_hsa_operand (hsa_op_base *opc);
+DEBUG_FUNCTION void debug_hsa_insn (hsa_insn_basic *insn);
+
+union hsa_bytes
+{
+  uint8_t b8;
+  uint16_t b16;
+  uint32_t b32;
+  uint64_t b64;
+};
+
+/* Return true if a function DECL is an HSA implementation.  */
+
+static inline bool
+hsa_gpu_implementation_p (tree decl)
+{
+  if (hsa_summaries == NULL)
+    return false;
+
+  hsa_function_summary *s = hsa_summaries->get (cgraph_node::get_create (decl));
+
+  return s->m_gpu_implementation_p;
+}
+
+#endif /* HSA_H */
diff --git a/gcc/toplev.c b/gcc/toplev.c
index 140e36f..68eada9 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -75,6 +75,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gcse.h"
 #include "tree-chkp.h"
 #include "omp-low.h"
+#include "hsa.h"
 
 #if defined(DBX_DEBUGGING_INFO) || defined(XCOFF_DEBUGGING_INFO)
 #include "dbxout.h"
@@ -520,6 +521,8 @@ compile_file (void)
 
       omp_finish_file ();
 
+      hsa_output_brig ();
+
       output_shared_constant_pool ();
       output_object_blocks ();
       finish_tm_clone_pairs ();

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [hsa 12/12] HSA register allocator
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (10 preceding siblings ...)
  2015-11-05 22:06 ` [hsa 11/12] Majority of the HSA back-end Martin Jambor
@ 2015-11-05 22:07 ` Martin Jambor
  2015-11-06 10:13 ` Merge of HSA branch Bernd Schmidt
  12 siblings, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-05 22:07 UTC (permalink / raw)
  To: GCC Patches; +Cc: Michael Matz

Hi,

because HSA backend is not based on RTL,we need our own, and it is in
this patch.  The allocator has been written by Michael Matz and I have
put it into a separate email so that I can add him to CC, because he
is much better suited to answer any questions or review comments.

Thanks,

Martin


2015-11-05  Michael Matz <matz@suse.de>
	    Martin Jambor  <mjambor@suse.cz>

	* hsa-regalloc.c: New file.


diff --git a/gcc/hsa-regalloc.c b/gcc/hsa-regalloc.c
new file mode 100644
index 0000000..3919258
--- /dev/null
+++ b/gcc/hsa-regalloc.c
@@ -0,0 +1,711 @@
+/* HSAIL IL Register allocation and out-of-SSA.
+   Copyright (C) 2013-15 Free Software Foundation, Inc.
+   Contributed by Michael Matz <matz@suse.de>
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "is-a.h"
+#include "vec.h"
+#include "tree.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "cfganal.h"
+#include "function.h"
+#include "bitmap.h"
+#include "dumpfile.h"
+#include "cgraph.h"
+#include "print-tree.h"
+#include "cfghooks.h"
+#include "symbol-summary.h"
+#include "hsa.h"
+
+
+/* Process a PHI node PHI of basic block BB as a part of naive out-f-ssa.  */
+
+static void
+naive_process_phi (hsa_insn_phi *phi)
+{
+  unsigned count = phi->operand_count ();
+  for (unsigned i = 0; i < count; i++)
+    {
+      gcc_checking_assert (phi->get_op (i));
+      hsa_op_base *op = phi->get_op (i);
+      hsa_bb *hbb;
+      edge e;
+
+      if (!op)
+	break;
+
+      e = EDGE_PRED (phi->m_bb, i);
+      if (single_succ_p (e->src))
+	hbb = hsa_bb_for_bb (e->src);
+      else
+	{
+	  basic_block old_dest = e->dest;
+	  hbb = hsa_init_new_bb (split_edge (e));
+
+	  /* If switch insn used this edge, fix jump table.  */
+	  hsa_bb *source = hsa_bb_for_bb (e->src);
+	  hsa_insn_sbr *sbr;
+	  if (source->m_last_insn
+	      && (sbr = dyn_cast <hsa_insn_sbr *> (source->m_last_insn)))
+	    sbr->replace_all_labels (old_dest, hbb->m_bb);
+	}
+
+      hsa_build_append_simple_mov (phi->m_dest, op, hbb);
+    }
+}
+
+/* Naive out-of SSA.  */
+
+static void
+naive_outof_ssa (void)
+{
+  basic_block bb;
+
+  hsa_cfun->m_in_ssa = false;
+
+  FOR_ALL_BB_FN (bb, cfun)
+  {
+    hsa_bb *hbb = hsa_bb_for_bb (bb);
+    hsa_insn_phi *phi;
+
+    for (phi = hbb->m_first_phi;
+	 phi;
+	 phi = phi->m_next ? as_a <hsa_insn_phi *> (phi->m_next): NULL)
+      naive_process_phi (phi);
+
+    /* Zap PHI nodes, they will be deallocated when everything else will.  */
+    hbb->m_first_phi = NULL;
+    hbb->m_last_phi = NULL;
+  }
+}
+
+/* Return register class number for the given HSA TYPE.  0 means the 'c' one
+   bit register class, 1 means 's' 32 bit class, 2 stands for 'd' 64 bit class
+   and 3 for 'q' 128 bit class.  */
+
+static int
+m_reg_class_for_type (BrigType16_t type)
+{
+  switch (type)
+    {
+    case BRIG_TYPE_B1:
+      return 0;
+
+    case BRIG_TYPE_U8:
+    case BRIG_TYPE_U16:
+    case BRIG_TYPE_U32:
+    case BRIG_TYPE_S8:
+    case BRIG_TYPE_S16:
+    case BRIG_TYPE_S32:
+    case BRIG_TYPE_F16:
+    case BRIG_TYPE_F32:
+    case BRIG_TYPE_B8:
+    case BRIG_TYPE_B16:
+    case BRIG_TYPE_B32:
+    case BRIG_TYPE_U8X4:
+    case BRIG_TYPE_S8X4:
+    case BRIG_TYPE_U16X2:
+    case BRIG_TYPE_S16X2:
+    case BRIG_TYPE_F16X2:
+      return 1;
+
+    case BRIG_TYPE_U64:
+    case BRIG_TYPE_S64:
+    case BRIG_TYPE_F64:
+    case BRIG_TYPE_B64:
+    case BRIG_TYPE_U8X8:
+    case BRIG_TYPE_S8X8:
+    case BRIG_TYPE_U16X4:
+    case BRIG_TYPE_S16X4:
+    case BRIG_TYPE_F16X4:
+    case BRIG_TYPE_U32X2:
+    case BRIG_TYPE_S32X2:
+    case BRIG_TYPE_F32X2:
+      return 2;
+
+    case BRIG_TYPE_B128:
+    case BRIG_TYPE_U8X16:
+    case BRIG_TYPE_S8X16:
+    case BRIG_TYPE_U16X8:
+    case BRIG_TYPE_S16X8:
+    case BRIG_TYPE_F16X8:
+    case BRIG_TYPE_U32X4:
+    case BRIG_TYPE_U64X2:
+    case BRIG_TYPE_S32X4:
+    case BRIG_TYPE_S64X2:
+    case BRIG_TYPE_F32X4:
+    case BRIG_TYPE_F64X2:
+      return 3;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* If the Ith operands of INSN is or contains a register (in an address),
+   return the address of that register operand.  If not return NULL.  */
+
+static hsa_op_reg **
+insn_reg_addr (hsa_insn_basic *insn, int i)
+{
+  hsa_op_base *op = insn->get_op (i);
+  if (!op)
+    return NULL;
+  hsa_op_reg *reg = dyn_cast <hsa_op_reg *> (op);
+  if (reg)
+    return (hsa_op_reg **) insn->get_op_addr (i);
+  hsa_op_address *addr = dyn_cast <hsa_op_address *> (op);
+  if (addr && addr->m_reg)
+    return &addr->m_reg;
+  return NULL;
+}
+
+struct m_reg_class_desc
+{
+  unsigned next_avail, max_num;
+  unsigned used_num, max_used;
+  uint64_t used[2];
+  char cl_char;
+};
+
+/* Rewrite the instructions in BB to observe spilled live ranges.
+   CLASSES is the global register class state.  */
+
+static void
+rewrite_code_bb (basic_block bb, struct m_reg_class_desc *classes)
+{
+  hsa_bb *hbb = hsa_bb_for_bb (bb);
+  hsa_insn_basic *insn, *next_insn;
+
+  for (insn = hbb->m_first_insn; insn; insn = next_insn)
+    {
+      next_insn = insn->m_next;
+      unsigned count = insn->operand_count ();
+      for (unsigned i = 0; i < count; i++)
+	{
+	  gcc_checking_assert (insn->get_op (i));
+	  hsa_op_reg **regaddr = insn_reg_addr (insn, i);
+
+	  if (regaddr)
+	    {
+	      hsa_op_reg *reg = *regaddr;
+	      if (reg->m_reg_class)
+		continue;
+	      gcc_assert (reg->m_spill_sym);
+
+	      int cl = m_reg_class_for_type (reg->m_type);
+	      hsa_op_reg *tmp, *tmp2;
+	      if (insn->op_output_p (i))
+		tmp = hsa_spill_out (insn, reg, &tmp2);
+	      else
+		tmp = hsa_spill_in (insn, reg, &tmp2);
+
+	      *regaddr = tmp;
+
+	      tmp->m_reg_class = classes[cl].cl_char;
+	      tmp->m_hard_num = (char) (classes[cl].max_num + i);
+	      if (tmp2)
+		{
+		  gcc_assert (cl == 0);
+		  tmp2->m_reg_class = classes[1].cl_char;
+		  tmp2->m_hard_num = (char) (classes[1].max_num + i);
+		}
+	    }
+	}
+    }
+}
+
+/* Dump current function to dump file F, with info specific
+   to register allocation.  */
+
+void
+dump_hsa_cfun_regalloc (FILE *f)
+{
+  basic_block bb;
+
+  fprintf (f, "\nHSAIL IL for %s\n", hsa_cfun->m_name);
+
+  FOR_ALL_BB_FN (bb, cfun)
+  {
+    hsa_bb *hbb = (struct hsa_bb *) bb->aux;
+    bitmap_print (dump_file, hbb->m_livein, "m_livein  ", "\n");
+    dump_hsa_bb (f, hbb);
+    bitmap_print (dump_file, hbb->m_liveout, "m_liveout ", "\n");
+  }
+}
+
+/* Given the global register allocation state CLASSES and a
+   register REG, try to give it a hardware register.  If successful,
+   store that hardreg in REG and return it, otherwise return -1.
+   Also changes CLASSES to accommodate for the allocated register.  */
+
+static int
+try_alloc_reg (struct m_reg_class_desc *classes, hsa_op_reg *reg)
+{
+  int cl = m_reg_class_for_type (reg->m_type);
+  int ret = -1;
+  if (classes[1].used_num + classes[2].used_num * 2 + classes[3].used_num * 4
+      >= 128 - 5)
+    return -1;
+  if (classes[cl].used_num < classes[cl].max_num)
+    {
+      unsigned int i;
+      classes[cl].used_num++;
+      if (classes[cl].used_num > classes[cl].max_used)
+	classes[cl].max_used = classes[cl].used_num;
+      for (i = 0; i < classes[cl].used_num; i++)
+	if (! (classes[cl].used[i / 64] & (((uint64_t)1) << (i & 63))))
+	  break;
+      ret = i;
+      classes[cl].used[i / 64] |= (((uint64_t)1) << (i & 63));
+      reg->m_reg_class = classes[cl].cl_char;
+      reg->m_hard_num = i;
+    }
+  return ret;
+}
+
+/* Free up hardregs used by REG, into allocation state CLASSES.  */
+
+static void
+free_reg (struct m_reg_class_desc *classes, hsa_op_reg *reg)
+{
+  int cl = m_reg_class_for_type (reg->m_type);
+  int ret = reg->m_hard_num;
+  gcc_assert (reg->m_reg_class == classes[cl].cl_char);
+  classes[cl].used_num--;
+  classes[cl].used[ret / 64] &= ~(((uint64_t)1) << (ret & 63));
+}
+
+/* Note that the live range for REG ends at least at END.  */
+
+static void
+note_lr_end (hsa_op_reg *reg, int end)
+{
+  if (reg->m_lr_end < end)
+    reg->m_lr_end = end;
+}
+
+/* Note that the live range for REG starts at least at BEGIN.  */
+
+static void
+note_lr_begin (hsa_op_reg *reg, int begin)
+{
+  if (reg->m_lr_begin > begin)
+    reg->m_lr_begin = begin;
+}
+
+/* Given two registers A and B, return -1, 0 or 1 if A's live range
+   starts before, at or after B's live range.  */
+
+static int
+cmp_begin (const void *a, const void *b)
+{
+  const hsa_op_reg * const *rega = (const hsa_op_reg * const *)a;
+  const hsa_op_reg * const *regb = (const hsa_op_reg * const *)b;
+  int ret;
+  if (rega == regb)
+    return 0;
+  ret = (*rega)->m_lr_begin - (*regb)->m_lr_begin;
+  if (ret)
+    return ret;
+  return ((*rega)->m_order - (*regb)->m_order);
+}
+
+/* Given two registers REGA and REGB, return true if REGA's
+   live range ends after REGB's.  This results in a sorting order
+   with earlier end points at the end.  */
+
+static bool
+cmp_end (hsa_op_reg * const &rega, hsa_op_reg * const &regb)
+{
+  int ret;
+  if (rega == regb)
+    return false;
+  ret = (regb)->m_lr_end - (rega)->m_lr_end;
+  if (ret)
+    return ret < 0;
+  return (((regb)->m_order - (rega)->m_order)) < 0;
+}
+
+/* Expire all old intervals in ACTIVE (a per-regclass vector),
+   that is, those that end before the interval REG starts.  Give
+   back resources freed so into the state CLASSES.  */
+
+static void
+expire_old_intervals (hsa_op_reg *reg, vec<hsa_op_reg*> *active,
+		      struct m_reg_class_desc *classes)
+{
+  for (int i = 0; i < 4; i++)
+    while (!active[i].is_empty ())
+      {
+	hsa_op_reg *a = active[i].pop ();
+	if (a->m_lr_end > reg->m_lr_begin)
+	  {
+	    active[i].quick_push (a);
+	    break;
+	  }
+	free_reg (classes, a);
+      }
+}
+
+/* The interval REG didn't get a hardreg.  Spill it or one of those
+   from ACTIVE (if the latter, then REG will become allocated to the
+   hardreg that formerly was used by it).  */
+
+static void
+spill_at_interval (hsa_op_reg *reg, vec<hsa_op_reg*> *active)
+{
+  int cl = m_reg_class_for_type (reg->m_type);
+  gcc_assert (!active[cl].is_empty ());
+  hsa_op_reg *cand = active[cl][0];
+  if (cand->m_lr_end > reg->m_lr_end)
+    {
+      reg->m_reg_class = cand->m_reg_class;
+      reg->m_hard_num = cand->m_hard_num;
+      active[cl].ordered_remove (0);
+      unsigned place = active[cl].lower_bound (reg, cmp_end);
+      active[cl].quick_insert (place, reg);
+    }
+  else
+    cand = reg;
+
+  gcc_assert (!cand->m_spill_sym);
+  BrigType16_t type = cand->m_type;
+  if (type == BRIG_TYPE_B1)
+    type = BRIG_TYPE_U8;
+  cand->m_reg_class = 0;
+  cand->m_spill_sym = hsa_get_spill_symbol (type);
+  cand->m_spill_sym->m_name_number = cand->m_order;
+}
+
+/* Given the global register state CLASSES allocate all HSA virtual
+   registers either to hardregs or to a spill symbol.  */
+
+static void
+linear_scan_regalloc (struct m_reg_class_desc *classes)
+{
+  /* Compute liveness.  */
+  bool changed;
+  int i, n;
+  int insn_order;
+  int *bbs = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
+  bitmap work = BITMAP_ALLOC (NULL);
+  vec<hsa_op_reg*> ind2reg = vNULL;
+  vec<hsa_op_reg*> active[4] = {vNULL, vNULL, vNULL, vNULL};
+  hsa_insn_basic *m_last_insn;
+
+  /* We will need the reverse post order for linearization,
+     and the post order for liveness analysis, which is the same
+     backward.  */
+  n = pre_and_rev_post_order_compute (NULL, bbs, true);
+  ind2reg.safe_grow_cleared (hsa_cfun->m_reg_count);
+
+  /* Give all instructions a linearized number, at the same time
+     build a mapping from register index to register.  */
+  insn_order = 1;
+  for (i = 0; i < n; i++)
+    {
+      basic_block bb = BASIC_BLOCK_FOR_FN (cfun, bbs[i]);
+      hsa_bb *hbb = hsa_bb_for_bb (bb);
+      hsa_insn_basic *insn;
+      for (insn = hbb->m_first_insn; insn; insn = insn->m_next)
+	{
+	  unsigned opi;
+	  insn->m_number = insn_order++;
+	  for (opi = 0; opi < insn->operand_count (); opi++)
+	    {
+	      gcc_checking_assert (insn->get_op (opi));
+	      hsa_op_reg **regaddr = insn_reg_addr (insn, opi);
+	      if (regaddr)
+		ind2reg[(*regaddr)->m_order] = *regaddr;
+	    }
+	}
+    }
+
+  /* Initialize all live ranges to [after-end, 0).  */
+  for (i = 0; i < hsa_cfun->m_reg_count; i++)
+    if (ind2reg[i])
+      ind2reg[i]->m_lr_begin = insn_order, ind2reg[i]->m_lr_end = 0;
+
+  /* Classic liveness analysis, as long as something changes:
+       m_liveout is union (m_livein of successors)
+       m_livein is m_liveout minus defs plus uses.  */
+  do
+    {
+      changed = false;
+      for (i = n - 1; i >= 0; i--)
+	{
+	  edge e;
+	  edge_iterator ei;
+	  basic_block bb = BASIC_BLOCK_FOR_FN (cfun, bbs[i]);
+	  hsa_bb *hbb = hsa_bb_for_bb (bb);
+
+	  /* Union of successors m_livein (or empty if none).  */
+	  bool first = true;
+	  FOR_EACH_EDGE (e, ei, bb->succs)
+	    if (e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun))
+	      {
+		hsa_bb *succ = hsa_bb_for_bb (e->dest);
+		if (first)
+		  {
+		    bitmap_copy (work, succ->m_livein);
+		    first = false;
+		  }
+		else
+		  bitmap_ior_into (work, succ->m_livein);
+	      }
+	  if (first)
+	    bitmap_clear (work);
+
+	  bitmap_copy (hbb->m_liveout, work);
+
+	  /* Remove defs, include uses in a backward insn walk.  */
+	  hsa_insn_basic *insn;
+	  for (insn = hbb->m_last_insn; insn; insn = insn->m_prev)
+	    {
+	      unsigned opi;
+	      unsigned ndefs = insn->input_count ();
+	      for (opi = 0; opi < ndefs && insn->get_op (opi); opi++)
+		{
+		  gcc_checking_assert (insn->get_op (opi));
+		  hsa_op_reg **regaddr = insn_reg_addr (insn, opi);
+		  if (regaddr)
+		    bitmap_clear_bit (work, (*regaddr)->m_order);
+		}
+	      for (; opi < insn->operand_count (); opi++)
+		{
+		  gcc_checking_assert (insn->get_op (opi));
+		  hsa_op_reg **regaddr = insn_reg_addr (insn, opi);
+		  if (regaddr)
+		    bitmap_set_bit (work, (*regaddr)->m_order);
+		}
+	    }
+
+	  /* Note if that changed something.  */
+	  if (bitmap_ior_into (hbb->m_livein, work))
+	    changed = true;
+	}
+    }
+  while (changed);
+
+  /* Make one pass through all instructions in linear order,
+     noting and merging possible live range start and end points.  */
+  m_last_insn = NULL;
+  for (i = n - 1; i >= 0; i--)
+    {
+      basic_block bb = BASIC_BLOCK_FOR_FN (cfun, bbs[i]);
+      hsa_bb *hbb = hsa_bb_for_bb (bb);
+      hsa_insn_basic *insn;
+      int after_end_number;
+      unsigned bit;
+      bitmap_iterator bi;
+
+      if (m_last_insn)
+	after_end_number = m_last_insn->m_number;
+      else
+	after_end_number = insn_order;
+      /* Everything live-out in this BB has at least an end point
+         after us. */
+      EXECUTE_IF_SET_IN_BITMAP (hbb->m_liveout, 0, bit, bi)
+	note_lr_end (ind2reg[bit], after_end_number);
+
+      for (insn = hbb->m_last_insn; insn; insn = insn->m_prev)
+	{
+	  unsigned opi;
+	  unsigned ndefs = insn->input_count ();
+	  for (opi = 0; opi < insn->operand_count (); opi++)
+	    {
+	      gcc_checking_assert (insn->get_op (opi));
+	      hsa_op_reg **regaddr = insn_reg_addr (insn, opi);
+	      if (regaddr)
+		{
+		  hsa_op_reg *reg = *regaddr;
+		  if (opi < ndefs)
+		    note_lr_begin (reg, insn->m_number);
+		  else
+		    note_lr_end (reg, insn->m_number);
+		}
+	    }
+	}
+
+      /* Everything live-in in this BB has a start point before
+         our first insn.  */
+      int before_start_number;
+      if (hbb->m_first_insn)
+	before_start_number = hbb->m_first_insn->m_number;
+      else
+	before_start_number = after_end_number;
+      before_start_number--;
+      EXECUTE_IF_SET_IN_BITMAP (hbb->m_livein, 0, bit, bi)
+	note_lr_begin (ind2reg[bit], before_start_number);
+
+      if (hbb->m_first_insn)
+	m_last_insn = hbb->m_first_insn;
+    }
+
+  /* All regs that have still their start at after all code actually
+     are defined at the start of the routine (prologue).  */
+  for (i = 0; i < hsa_cfun->m_reg_count; i++)
+    if (ind2reg[i] && ind2reg[i]->m_lr_begin == insn_order)
+      ind2reg[i]->m_lr_begin = 0;
+
+  /* Sort all intervals by increasing start point.  */
+  gcc_assert (ind2reg.length () == (size_t) hsa_cfun->m_reg_count);
+
+#ifdef ENABLE_CHECKING
+  for (unsigned i = 0; i < ind2reg.length (); i++)
+    gcc_assert (ind2reg[i]);
+#endif
+
+  ind2reg.qsort (cmp_begin);
+  for (i = 0; i < 4; i++)
+    active[i].reserve_exact (hsa_cfun->m_reg_count);
+
+  /* Now comes the linear scan allocation.  */
+  for (i = 0; i < hsa_cfun->m_reg_count; i++)
+    {
+      hsa_op_reg *reg = ind2reg[i];
+      if (!reg)
+	continue;
+      expire_old_intervals (reg, active, classes);
+      int cl = m_reg_class_for_type (reg->m_type);
+      if (try_alloc_reg (classes, reg) >= 0)
+	{
+	  unsigned place = active[cl].lower_bound (reg, cmp_end);
+	  active[cl].quick_insert (place, reg);
+	}
+      else
+	spill_at_interval (reg, active);
+
+      /* Some interesting dumping as we go.  */
+      if (dump_file)
+	{
+	  fprintf (dump_file, "  reg%d: [%5d, %5d)->",
+		   reg->m_order, reg->m_lr_begin, reg->m_lr_end);
+	  if (reg->m_reg_class)
+	    fprintf (dump_file, "$%c%i", reg->m_reg_class, reg->m_hard_num);
+	  else
+	    fprintf (dump_file, "[%%__%s_%i]",
+		     hsa_seg_name (reg->m_spill_sym->m_segment),
+		     reg->m_spill_sym->m_name_number);
+	  for (int cl = 0; cl < 4; cl++)
+	    {
+	      bool first = true;
+	      hsa_op_reg *r;
+	      fprintf (dump_file, " {");
+	      for (int j = 0; active[cl].iterate (j, &r); j++)
+		if (first)
+		  {
+		    fprintf (dump_file, "%d", r->m_order);
+		    first = false;
+		  }
+		else
+		  fprintf (dump_file, ", %d", r->m_order);
+	      fprintf (dump_file, "}");
+	    }
+	  fprintf (dump_file, "\n");
+	}
+    }
+
+  BITMAP_FREE (work);
+  free (bbs);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "------- After liveness: -------\n");
+      dump_hsa_cfun_regalloc (dump_file);
+      fprintf (dump_file, "  ----- Intervals:\n");
+      for (i = 0; i < hsa_cfun->m_reg_count; i++)
+	{
+	  hsa_op_reg *reg = ind2reg[i];
+	  if (!reg)
+	    continue;
+	  fprintf (dump_file, "  reg%d: [%5d, %5d)->", reg->m_order,
+		   reg->m_lr_begin, reg->m_lr_end);
+	  if (reg->m_reg_class)
+	    fprintf (dump_file, "$%c%i\n", reg->m_reg_class, reg->m_hard_num);
+	  else
+	    fprintf (dump_file, "[%%__%s_%i]\n",
+		     hsa_seg_name (reg->m_spill_sym->m_segment),
+		     reg->m_spill_sym->m_name_number);
+	}
+    }
+
+  for (i = 0; i < 4; i++)
+    active[i].release ();
+  ind2reg.release ();
+}
+
+/* Entry point for register allocation.  */
+
+static void
+regalloc (void)
+{
+  basic_block bb;
+  m_reg_class_desc classes[4];
+
+  /* If there are no registers used in the function, exit right away. */
+  if (hsa_cfun->m_reg_count == 0)
+    return;
+
+  memset (classes, 0, sizeof (classes));
+  classes[0].next_avail = 0;
+  classes[0].max_num = 7;
+  classes[0].cl_char = 'c';
+  classes[1].cl_char = 's';
+  classes[2].cl_char = 'd';
+  classes[3].cl_char = 'q';
+
+  for (int i = 1; i < 4; i++)
+    {
+      classes[i].next_avail = 0;
+      classes[i].max_num = 20;
+    }
+
+  linear_scan_regalloc (classes);
+
+  FOR_ALL_BB_FN (bb, cfun)
+    rewrite_code_bb (bb, classes);
+}
+
+/* Out of SSA and register allocation on HSAIL IL.  */
+
+void
+hsa_regalloc (void)
+{
+  naive_outof_ssa ();
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "------- After out-of-SSA: -------\n");
+      dump_hsa_cfun (dump_file);
+    }
+
+  regalloc ();
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "------- After register allocation: -------\n");
+      dump_hsa_cfun (dump_file);
+    }
+}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 1/12] Configuration and offloading-related changes
  2015-11-05 21:53 ` [hsa 1/12] Configuration and offloading-related changes Martin Jambor
@ 2015-11-05 22:47   ` Joseph Myers
  2015-11-09 16:57     ` Martin Jambor
  0 siblings, 1 reply; 44+ messages in thread
From: Joseph Myers @ 2015-11-05 22:47 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Jakub Jelinek

On Thu, 5 Nov 2015, Martin Jambor wrote:

> libgomp plugin to be built.  Because the plugin needs to use HSA
> run-time library, I have introduced options --with-hsa-runtime (and
> more precise --with-hsa-include and --with-hsa-lib) to help find it.

New configure options should be documented in install.texi.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 3/12] HSA libgomp plugin
  2015-11-05 21:56 ` [hsa 3/12] HSA libgomp plugin Martin Jambor
@ 2015-11-05 22:47   ` Joseph Myers
  2015-11-09 16:58     ` Martin Jambor
  0 siblings, 1 reply; 44+ messages in thread
From: Joseph Myers @ 2015-11-05 22:47 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Jakub Jelinek

This new file should have the standard libgomp copyright / license notice.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 5/12] New HSA-related GCC options
  2015-11-05 21:58 ` [hsa 5/12] New HSA-related GCC options Martin Jambor
@ 2015-11-05 22:48   ` Joseph Myers
  2015-11-06  8:42   ` Richard Biener
  1 sibling, 0 replies; 44+ messages in thread
From: Joseph Myers @ 2015-11-05 22:48 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Richard Biener

The new options need to be documented in invoke.texi.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 7/12] Disabling the vectorizer for GPU kernels/functions
  2015-11-05 22:01 ` [hsa 7/12] Disabling the vectorizer for GPU kernels/functions Martin Jambor
@ 2015-11-06  8:38   ` Richard Biener
  2015-11-10 14:48     ` Martin Jambor
  0 siblings, 1 reply; 44+ messages in thread
From: Richard Biener @ 2015-11-06  8:38 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches

On Thu, 5 Nov 2015, Martin Jambor wrote:

> Hi,
> 
> in the previous email I wrote we need to "change behavior" of a few
> optimization passes.  One was the flattening of GPU functions and the
> other two are in the patch below.  It all comes to that, at the
> moment, we need to switch off the vectorizer (only for the GPU
> functions, of course).
> 
> We are actually quite close to being able to handle gimple vector
> input in HSA back-end but not all the way yet, and before allowing the
> vectorizer again, we will have to make sure it never produces vectors
> bigger than 128bits (in GPU functions).

Hmm.  I'd rather have this modify
DECL_FUNCTION_SPECIFIC_OPTIMIZATION of the hsa function to get this
effect.  I think I mentioned this to the OACC guys as well for a
similar needs of them.

Richard.

> Thanks,
> 
> Martin
> 
> 
> 2015-11-05  Martin Jambor  <mjambor@suse.cz>
> 
> 	* tree-ssa-loop.c: Include cgraph.c, symbol-summary.c and hsa.h.
> 	(pass_vectorize::gate): Do not run on HSA functions.
> 	* tree-vectorizer.c: Include symbol-summary.c and hsa.h.
> 	(pass_slp_vectorize::gate): Do not run on HSA functions.
> 
> diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
> index 8ecd140..0d119e2 100644
> --- a/gcc/tree-ssa-loop.c
> +++ b/gcc/tree-ssa-loop.c
> @@ -35,6 +35,9 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-inline.h"
>  #include "tree-scalar-evolution.h"
>  #include "tree-vectorizer.h"
> +#include "cgraph.h"
> +#include "symbol-summary.h"
> +#include "hsa.h"
>  
>  
>  /* A pass making sure loops are fixed up.  */
> @@ -257,7 +260,8 @@ public:
>    /* opt_pass methods: */
>    virtual bool gate (function *fun)
>      {
> -      return flag_tree_loop_vectorize || fun->has_force_vectorize_loops;
> +      return (flag_tree_loop_vectorize || fun->has_force_vectorize_loops)
> +	&& !hsa_gpu_implementation_p (fun->decl);
>      }
>  
>    virtual unsigned int execute (function *);
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index b80a8dd..366138c 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -75,6 +75,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-ssa-propagate.h"
>  #include "dbgcnt.h"
>  #include "tree-scalar-evolution.h"
> +#include "symbol-summary.h"
> +#include "hsa.h"
>  
>  
>  /* Loop or bb location.  */
> @@ -675,7 +677,10 @@ public:
>  
>    /* opt_pass methods: */
>    opt_pass * clone () { return new pass_slp_vectorize (m_ctxt); }
> -  virtual bool gate (function *) { return flag_tree_slp_vectorize != 0; }
> +  virtual bool gate (function *fun)
> +  {
> +    return flag_tree_slp_vectorize && !hsa_gpu_implementation_p (fun->decl);
> +  }
>    virtual unsigned int execute (function *);
>  
>  }; // class pass_slp_vectorize
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 5/12] New HSA-related GCC options
  2015-11-05 21:58 ` [hsa 5/12] New HSA-related GCC options Martin Jambor
  2015-11-05 22:48   ` Joseph Myers
@ 2015-11-06  8:42   ` Richard Biener
  2015-11-09 16:59     ` Martin Jambor
  1 sibling, 1 reply; 44+ messages in thread
From: Richard Biener @ 2015-11-06  8:42 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches

On Thu, 5 Nov 2015, Martin Jambor wrote:

> Hi,
> 
> the following small part of the merge deals with new options.  It adds
> four independent things:
> 
> 1) flag_disable_hsa is used by code in opts.c (in the first patch) to
>    remember whether HSA has been explicitely disabled on the compiler
>    command line.

But I don't see any way to disable it on the command line?  (no switch?)

> 2) -Whsa is a new warning we emit whenever we fail to produce HSAIL
>    for some source code.  It is on by default but of course only
>    emitted by HSAIL generating code so should never affect anybody who
>    does not use HSA-enabled compiler and OpenMP 4 device constructs.
> 
> We have found the following two additions very useful for debugging on
> the branch but will understand if they are not deemed suitable for
> trunk and will gladly remove them:
> 
> 3) -fdisable-hsa-gridification disables the gridification process to
>    ease experimenting with dynamic parallelism.  With this option,
>    HSAIL is always generated from the CPU-intended gimple.

So this sounds like sth a user should never do which means
it shouln't be a switch (but a parameter or removed).

> 4) Parameter hsa-gen-debug-stores will be obsolete once HSA run-time
>    supports debugging traps.  Before that, we have to do with
>    debugging stores to memory at defined places, which however can
>    cost speed in benchmarks.  So they are only enabled with this
>    parameter.  We decided to make it a parameter rather than a switch
>    to emphasize the fact it will go away and to possibly allow us
>    select different levels of verbosity of the stores in the future).

You miss documentation in invoke.texi for new switches and parameters.

> Thanks,
> 
> Martin
> 
> 
> 2015-11-05  Martin Jambor  <mjambor@suse.cz>
> 
> 	* common.opt (disable_hsa): New variable.
> 	(-Whsa): New warning.
> 	(-fdisable-hsa-gridification): New option.
> 	* params.def (PARAM_HSA_GEN_DEBUG_STORES): New parameter.
> 
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 961a1b6..9cb52db 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -223,6 +223,10 @@ unsigned int flag_sanitize_recover = SANITIZE_UNDEFINED | SANITIZE_NONDEFAULT |
>  Variable
>  bool dump_base_name_prefixed = false
>  
> +; Flag whether HSA generation has been explicitely disabled
> +Variable
> +bool flag_disable_hsa = false
> +
>  ###
>  Driver
>  
> @@ -577,6 +581,10 @@ Wfree-nonheap-object
>  Common Var(warn_free_nonheap_object) Init(1) Warning
>  Warn when attempting to free a non-heap object.
>  
> +Whsa
> +Common Var(warn_hsa) Init(1) Warning
> +Warn when a function cannot be expanded to HSAIL.
> +
>  Winline
>  Common Var(warn_inline) Warning
>  Warn when an inlined function cannot be inlined.
> @@ -1107,6 +1115,10 @@ fdiagnostics-show-location=
>  Common Joined RejectNegative Enum(diagnostic_prefixing_rule)
>  -fdiagnostics-show-location=[once|every-line]	How often to emit source location at the beginning of line-wrapped diagnostics.
>  
> +fdisable-hsa-gridification
> +Common Report Var(flag_disable_hsa_gridification)
> +Disable HSA gridification for OMP pragmas
> +
>  ; Required for these enum values.
>  SourceInclude
>  pretty-print.h
> diff --git a/gcc/params.def b/gcc/params.def
> index c5d96e7..86911e2 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1177,6 +1177,11 @@ DEFPARAM (PARAM_MAX_SSA_NAME_QUERY_DEPTH,
>  	  "Maximum recursion depth allowed when querying a property of an"
>  	  " SSA name.",
>  	  2, 1, 0)
> +
> +DEFPARAM (PARAM_HSA_GEN_DEBUG_STORES,
> +	  "hsa-gen-debug-stores",
> +	  "Level of hsa debug stores verbosity",
> +	  0, 0, 1)
>  /*
>  
>  Local variables:
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 9/12] Small alloc-pool fix
  2015-11-05 22:03 ` [hsa 9/12] Small alloc-pool fix Martin Jambor
@ 2015-11-06  9:00   ` Richard Biener
  2015-11-06  9:52     ` Martin Liška
  0 siblings, 1 reply; 44+ messages in thread
From: Richard Biener @ 2015-11-06  9:00 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Martin Liska

On Thu, 5 Nov 2015, Martin Jambor wrote:

> Hi,
> 
> we use C++ new operators based on alloc-pools a lot in the subsequent
> patches and realized that on the current trunk, such new operators
> would needlessly call the placement ::new operator within the allocate
> method of pool-alloc.  Fixed below by providing a new allocation
> method which does not call placement new, which is only safe to use
> from within a new operator.
> 
> The patch also fixes the slightly weird two parameter operator new
> (which we do not use in HSA backend) so that it does not do the same.

Why do you need to add the pointer variant then?

Also isn't the issue with allocate() that it does

    return ::new (m_allocator.allocate ()) T ();

which 1) value-initializes and 2) doesn't even work with types like

struct T { T(int); };

thus types without a default constructor.

I think the allocator was poorly C++-ified without updating the
specification for the cases it is supposed to handle.  And now
we have C++ uses that are not working because the allocator is
broken.

An incrementally better version (w/o fixing the issue with
types w/o default constructor) is

    return ::new (m_allocator.allocate ()) T;

thus default-initialize which does no initialization for PODs (without
array members...) which is what the old pool allocator did.

To fix the new operator (how do you even call that?  does it allow
specifying constructor args and thus work without a default constructor?)
it should indeed use an allocation method not performing the placement
new.  But I'd call it allocate_raw rather than vallocate.

Thanks.
Richard.

> Thanks,
> 
> Martin
> 
> 
> 2015-11-05  Martin Liska  <mliska@suse.cz>
> 	    Martin Jambor  <mjambor@suse.cz>
> 
> 	* alloc-pool.h (object_allocator::vallocate): New method.
> 	(operator new): Call vallocate instead of allocate.
> 	(operator new): New operator.
> 
> 
> diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
> index 0dc05cd..46b6550 100644
> --- a/gcc/alloc-pool.h
> +++ b/gcc/alloc-pool.h
> @@ -483,6 +483,12 @@ public:
>      return ::new (m_allocator.allocate ()) T ();
>    }
>  
> +  inline void *
> +  vallocate () ATTRIBUTE_MALLOC
> +  {
> +    return m_allocator.allocate ();
> +  }
> +
>    inline void
>    remove (T *object)
>    {
> @@ -523,12 +529,19 @@ struct alloc_pool_descriptor
>  };
>  
>  /* Helper for classes that do not provide default ctor.  */
> -
>  template <typename T>
>  inline void *
>  operator new (size_t, object_allocator<T> &a)
>  {
> -  return a.allocate ();
> +  return a.vallocate ();
> +}
> +
> +/* Helper for classes that do not provide default ctor.  */
> +template <typename T>
> +inline void *
> +operator new (size_t, object_allocator<T> *a)
> +{
> +  return a->vallocate ();
>  }
>  
>  /* Hashtable mapping alloc_pool names to descriptors.  */
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 9/12] Small alloc-pool fix
  2015-11-06  9:00   ` Richard Biener
@ 2015-11-06  9:52     ` Martin Liška
  2015-11-06  9:57       ` Richard Biener
  0 siblings, 1 reply; 44+ messages in thread
From: Martin Liška @ 2015-11-06  9:52 UTC (permalink / raw)
  To: Richard Biener, Martin Jambor; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 4417 bytes --]

On 11/06/2015 10:00 AM, Richard Biener wrote:
> On Thu, 5 Nov 2015, Martin Jambor wrote:
> 
>> Hi,
>>
>> we use C++ new operators based on alloc-pools a lot in the subsequent
>> patches and realized that on the current trunk, such new operators
>> would needlessly call the placement ::new operator within the allocate
>> method of pool-alloc.  Fixed below by providing a new allocation
>> method which does not call placement new, which is only safe to use
>> from within a new operator.
>>
>> The patch also fixes the slightly weird two parameter operator new
>> (which we do not use in HSA backend) so that it does not do the same.
> 

Hi.

> Why do you need to add the pointer variant then?

You are right, we originally used the variant in the branch, but it was eventually
left.

> 
> Also isn't the issue with allocate() that it does
> 
>     return ::new (m_allocator.allocate ()) T ();
> 
> which 1) value-initializes and 2) doesn't even work with types like
> 
> struct T { T(int); };
> 
> thus types without a default constructor.

You are right, it produces compilation error.

> 
> I think the allocator was poorly C++-ified without updating the
> specification for the cases it is supposed to handle.  And now
> we have C++ uses that are not working because the allocator is
> broken.
> 
> An incrementally better version (w/o fixing the issue with
> types w/o default constructor) is
> 
>     return ::new (m_allocator.allocate ()) T;

I've tried that, and it also calls default ctor:

../../gcc/alloc-pool.h: In instantiation of Â‘T* object_allocator<T>::allocate() [with T = et_occ]Â’:
../../gcc/alloc-pool.h:531:22:   required from Â‘void* operator new(size_t, object_allocator<T>&) [with T = et_occ; size_t = long unsigned int]Â’
../../gcc/et-forest.c:449:46:   required from here
../../gcc/et-forest.c:58:3: error: Â‘et_occ::et_occ()Â’ is private
   et_occ ();
   ^
In file included from ../../gcc/et-forest.c:28:0:
../../gcc/alloc-pool.h:483:44: error: within this context
     return ::new (m_allocator.allocate ()) T;


> 
> thus default-initialize which does no initialization for PODs (without
> array members...) which is what the old pool allocator did.

I'm not so familiar with differences related to PODs.

> 
> To fix the new operator (how do you even call that?  does it allow
> specifying constructor args and thus work without a default constructor?)
> it should indeed use an allocation method not performing the placement
> new.  But I'd call it allocate_raw rather than vallocate.

For situations where do not have a default ctor, one should you the helper method defined
at the end of alloc-pool.h:

template <typename T>
inline void *
operator new (size_t, object_allocator<T> &a)
{
  return a.allocate ();
}

For instance:
et_occ *nw = new (et_occurrences) et_occ (2);

or as used in the HSA branch:

/* New operator to allocate convert instruction from pool alloc.  */

void *
hsa_insn_cvt::operator new (size_t)
{
  return hsa_allocp_inst_cvt->allocate_raw ();
}

and

cvtinsn = new hsa_insn_cvt (reg, *ptmp2);


I attached patch where I rename the method as suggested.

Thanks,
Martin

> 
> Thanks.
> Richard.
> 
>> Thanks,
>>
>> Martin
>>
>>
>> 2015-11-05  Martin Liska  <mliska@suse.cz>
>> 	    Martin Jambor  <mjambor@suse.cz>
>>
>> 	* alloc-pool.h (object_allocator::vallocate): New method.
>> 	(operator new): Call vallocate instead of allocate.
>> 	(operator new): New operator.
>>
>>
>> diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
>> index 0dc05cd..46b6550 100644
>> --- a/gcc/alloc-pool.h
>> +++ b/gcc/alloc-pool.h
>> @@ -483,6 +483,12 @@ public:
>>      return ::new (m_allocator.allocate ()) T ();
>>    }
>>  
>> +  inline void *
>> +  vallocate () ATTRIBUTE_MALLOC
>> +  {
>> +    return m_allocator.allocate ();
>> +  }
>> +
>>    inline void
>>    remove (T *object)
>>    {
>> @@ -523,12 +529,19 @@ struct alloc_pool_descriptor
>>  };
>>  
>>  /* Helper for classes that do not provide default ctor.  */
>> -
>>  template <typename T>
>>  inline void *
>>  operator new (size_t, object_allocator<T> &a)
>>  {
>> -  return a.allocate ();
>> +  return a.vallocate ();
>> +}
>> +
>> +/* Helper for classes that do not provide default ctor.  */
>> +template <typename T>
>> +inline void *
>> +operator new (size_t, object_allocator<T> *a)
>> +{
>> +  return a->vallocate ();
>>  }
>>  
>>  /* Hashtable mapping alloc_pool names to descriptors.  */
>>
>>
> 


[-- Attachment #2: alloc-pool.patch --]
[-- Type: text/x-patch, Size: 927 bytes --]

diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
index 0dc05cd..8b8c023 100644
--- a/gcc/alloc-pool.h
+++ b/gcc/alloc-pool.h
@@ -477,11 +477,22 @@ public:
     m_allocator.release_if_empty ();
   }
 
+  /* Allocate memory for instance of type T and call a default constructor.  */
+
   inline T *
   allocate () ATTRIBUTE_MALLOC
   {
     return ::new (m_allocator.allocate ()) T ();
   }
+  /* Allocate memory for instance of type T and return void * that
+     could be used in situations where a default constructor is not provided
+     by the class T.  */
+
+  inline void *
+  allocate_raw () ATTRIBUTE_MALLOC
+  {
+    return m_allocator.allocate ();
+  }
 
   inline void
   remove (T *object)
@@ -528,7 +539,7 @@ template <typename T>
 inline void *
 operator new (size_t, object_allocator<T> &a)
 {
-  return a.allocate ();
+  return a.allocate_raw ();
 }
 
 /* Hashtable mapping alloc_pool names to descriptors.  */

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 9/12] Small alloc-pool fix
  2015-11-06  9:52     ` Martin Liška
@ 2015-11-06  9:57       ` Richard Biener
  2015-11-10  8:48         ` Martin Liška
  0 siblings, 1 reply; 44+ messages in thread
From: Richard Biener @ 2015-11-06  9:57 UTC (permalink / raw)
  To: Martin Liška; +Cc: Martin Jambor, GCC Patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5084 bytes --]

On Fri, 6 Nov 2015, Martin LiÅ¡ka wrote:

> On 11/06/2015 10:00 AM, Richard Biener wrote:
> > On Thu, 5 Nov 2015, Martin Jambor wrote:
> > 
> >> Hi,
> >>
> >> we use C++ new operators based on alloc-pools a lot in the subsequent
> >> patches and realized that on the current trunk, such new operators
> >> would needlessly call the placement ::new operator within the allocate
> >> method of pool-alloc.  Fixed below by providing a new allocation
> >> method which does not call placement new, which is only safe to use
> >> from within a new operator.
> >>
> >> The patch also fixes the slightly weird two parameter operator new
> >> (which we do not use in HSA backend) so that it does not do the same.
> > 
> 
> Hi.
> 
> > Why do you need to add the pointer variant then?
> 
> You are right, we originally used the variant in the branch, but it was eventually
> left.
> 
> > 
> > Also isn't the issue with allocate() that it does
> > 
> >     return ::new (m_allocator.allocate ()) T ();
> > 
> > which 1) value-initializes and 2) doesn't even work with types like
> > 
> > struct T { T(int); };
> > 
> > thus types without a default constructor.
> 
> You are right, it produces compilation error.
> 
> > 
> > I think the allocator was poorly C++-ified without updating the
> > specification for the cases it is supposed to handle.  And now
> > we have C++ uses that are not working because the allocator is
> > broken.
> > 
> > An incrementally better version (w/o fixing the issue with
> > types w/o default constructor) is
> > 
> >     return ::new (m_allocator.allocate ()) T;
> 
> I've tried that, and it also calls default ctor:
> 
> ../../gcc/alloc-pool.h: In instantiation of â€˜T* object_allocator<T>::allocate() [with T = et_occ]â€™:
> ../../gcc/alloc-pool.h:531:22:   required from â€˜void* operator new(size_t, object_allocator<T>&) [with T = et_occ; size_t = long unsigned int]â€™
> ../../gcc/et-forest.c:449:46:   required from here
> ../../gcc/et-forest.c:58:3: error: â€˜et_occ::et_occ()â€™ is private
>    et_occ ();
>    ^
> In file included from ../../gcc/et-forest.c:28:0:
> ../../gcc/alloc-pool.h:483:44: error: within this context
>      return ::new (m_allocator.allocate ()) T;

Yes, but it does slightly cheaper initialization of PODs

> 
> > 
> > thus default-initialize which does no initialization for PODs (without
> > array members...) which is what the old pool allocator did.
> 
> I'm not so familiar with differences related to PODs.
> 
> > 
> > To fix the new operator (how do you even call that?  does it allow
> > specifying constructor args and thus work without a default constructor?)
> > it should indeed use an allocation method not performing the placement
> > new.  But I'd call it allocate_raw rather than vallocate.
> 
> For situations where do not have a default ctor, one should you the 
> helper method defined at the end of alloc-pool.h:
> 
> template <typename T>
> inline void *
> operator new (size_t, object_allocator<T> &a)
> {
>   return a.allocate ();
> }
> 
> For instance:
> et_occ *nw = new (et_occurrences) et_occ (2);

Oh, so it uses placement new syntax...  works for me.

> or as used in the HSA branch:
> 
> /* New operator to allocate convert instruction from pool alloc.  */
> 
> void *
> hsa_insn_cvt::operator new (size_t)
> {
>   return hsa_allocp_inst_cvt->allocate_raw ();
> }
> 
> and
> 
> cvtinsn = new hsa_insn_cvt (reg, *ptmp2);
> 
> 
> I attached patch where I rename the method as suggested.

Ok.

Thanks,
Richard.

> Thanks,
> Martin
> 
> > 
> > Thanks.
> > Richard.
> > 
> >> Thanks,
> >>
> >> Martin
> >>
> >>
> >> 2015-11-05  Martin Liska  <mliska@suse.cz>
> >> 	    Martin Jambor  <mjambor@suse.cz>
> >>
> >> 	* alloc-pool.h (object_allocator::vallocate): New method.
> >> 	(operator new): Call vallocate instead of allocate.
> >> 	(operator new): New operator.
> >>
> >>
> >> diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
> >> index 0dc05cd..46b6550 100644
> >> --- a/gcc/alloc-pool.h
> >> +++ b/gcc/alloc-pool.h
> >> @@ -483,6 +483,12 @@ public:
> >>      return ::new (m_allocator.allocate ()) T ();
> >>    }
> >>  
> >> +  inline void *
> >> +  vallocate () ATTRIBUTE_MALLOC
> >> +  {
> >> +    return m_allocator.allocate ();
> >> +  }
> >> +
> >>    inline void
> >>    remove (T *object)
> >>    {
> >> @@ -523,12 +529,19 @@ struct alloc_pool_descriptor
> >>  };
> >>  
> >>  /* Helper for classes that do not provide default ctor.  */
> >> -
> >>  template <typename T>
> >>  inline void *
> >>  operator new (size_t, object_allocator<T> &a)
> >>  {
> >> -  return a.allocate ();
> >> +  return a.vallocate ();
> >> +}
> >> +
> >> +/* Helper for classes that do not provide default ctor.  */
> >> +template <typename T>
> >> +inline void *
> >> +operator new (size_t, object_allocator<T> *a)
> >> +{
> >> +  return a->vallocate ();
> >>  }
> >>  
> >>  /* Hashtable mapping alloc_pool names to descriptors.  */
> >>
> >>
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Merge of HSA branch
  2015-11-05 21:51 Merge of HSA branch Martin Jambor
                   ` (11 preceding siblings ...)
  2015-11-05 22:07 ` [hsa 12/12] HSA register allocator Martin Jambor
@ 2015-11-06 10:13 ` Bernd Schmidt
  2015-11-06 10:30   ` Richard Biener
  2015-11-06 10:54   ` Martin Liška
  12 siblings, 2 replies; 44+ messages in thread
From: Bernd Schmidt @ 2015-11-06 10:13 UTC (permalink / raw)
  To: GCC Patches, Jakub Jelinek, Richard Biener, Martin Liska, Michael Matz

On 11/05/2015 10:51 PM, Martin Jambor wrote:
> Individual changes are described in slightly more detail in their
> respective messages.  If you are interested in how the HSAIL
> generation works in general, I encourage you to have a look at my
> Cauldron slides or presentation, only very few things have changed as
> far as the general principles are concerned.  Let me just quickly stress
> here that we do acceleration within a single compiler, as opposed to
> LTO-ways of all the other accelerator teams.

Realistically we're probably not going to reject this work, but I still 
want to ask whether the approach was acked by the community before you 
started. I'm really not exactly thrilled about having two different 
classes of backends in the compiler, and two different ways of handling 
offloading.

> I also acknowledge that we should add HSA-specific tests to the GCC
> testsuite but we are only now looking at how to do that and will
> welcome any guidance in this regard.

Yeah, I was looking for any kind of new test, because...

> the class of OpenMP loops we can handle well is small,

I'd appreciate more information on what this means. Any examples or 
performance numbers?


Bernd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Merge of HSA branch
  2015-11-06 10:13 ` Merge of HSA branch Bernd Schmidt
@ 2015-11-06 10:30   ` Richard Biener
  2015-11-06 11:03     ` Bernd Schmidt
  2015-11-06 10:54   ` Martin Liška
  1 sibling, 1 reply; 44+ messages in thread
From: Richard Biener @ 2015-11-06 10:30 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, Jakub Jelinek, Martin Liska, Michael Matz

On Fri, 6 Nov 2015, Bernd Schmidt wrote:

> On 11/05/2015 10:51 PM, Martin Jambor wrote:
> > Individual changes are described in slightly more detail in their
> > respective messages.  If you are interested in how the HSAIL
> > generation works in general, I encourage you to have a look at my
> > Cauldron slides or presentation, only very few things have changed as
> > far as the general principles are concerned.  Let me just quickly stress
> > here that we do acceleration within a single compiler, as opposed to
> > LTO-ways of all the other accelerator teams.
> 
> Realistically we're probably not going to reject this work, but I still want
> to ask whether the approach was acked by the community before you started. I'm
> really not exactly thrilled about having two different classes of backends in
> the compiler, and two different ways of handling offloading.

Realistically the other approaches werent acked either (well, implicitely
by review).  Not doing an RTL backend for NVPTX would have simplified
your life as well.  Not doing an RTL backend practically means not
going the LTO way as you couldn't easily even build a target without
RTL pieces (not sure how big a "dummy" RTL target would be).

Richard.

> > I also acknowledge that we should add HSA-specific tests to the GCC
> > testsuite but we are only now looking at how to do that and will
> > welcome any guidance in this regard.
> 
> Yeah, I was looking for any kind of new test, because...
>
> > the class of OpenMP loops we can handle well is small,
> 
> I'd appreciate more information on what this means. Any examples or
> performance numbers?
> 
> 
> Bernd
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Merge of HSA branch
  2015-11-06 10:13 ` Merge of HSA branch Bernd Schmidt
  2015-11-06 10:30   ` Richard Biener
@ 2015-11-06 10:54   ` Martin Liška
  1 sibling, 0 replies; 44+ messages in thread
From: Martin Liška @ 2015-11-06 10:54 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt

On 11/06/2015 11:12 AM, Bernd Schmidt wrote:
> On 11/05/2015 10:51 PM, Martin Jambor wrote:
>> Individual changes are described in slightly more detail in their
>> respective messages.  If you are interested in how the HSAIL
>> generation works in general, I encourage you to have a look at my
>> Cauldron slides or presentation, only very few things have changed as
>> far as the general principles are concerned.  Let me just quickly stress
>> here that we do acceleration within a single compiler, as opposed to
>> LTO-ways of all the other accelerator teams.
> 
> Realistically we're probably not going to reject this work, but I still want to ask whether the approach was acked by the community before you started. I'm really not exactly thrilled about having two different classes of backends in the compiler, and two different ways of handling offloading.
> 
>> I also acknowledge that we should add HSA-specific tests to the GCC
>> testsuite but we are only now looking at how to do that and will
>> welcome any guidance in this regard.
> 
> Yeah, I was looking for any kind of new test, because...
> 
>> the class of OpenMP loops we can handle well is small,
> 
> I'd appreciate more information on what this means. Any examples or performance numbers?

Hello.

As mentioned by Martin Jambor, it was explained during his speech at the Cauldron this year.
It can be easily explained on the following simple case:

#pragma omp target teams
#pragma omp distribute parallel for private(j)
   for (j=0; j<N; j++)
      c[j] = a[j];

Which is simple vector copy, that's going to be transformed to:

_4 = omp_data.i_1(D).D.5301 (iteration space)
_5 = __builtin_omp_get_num_threads ();
_6 = __builtin_omp_get_thread_num ();
_7 = calculate_chunk_start (_4, _5, _6); // pseudocode
_8 = calculate_chunk_end (_4, _5, _6); // pseudocode

for(i = _7; i < _8; i++)
  dest[i] = src[i];

and such kernel is dispatched with default grid size (in our case 64), so that every
work item handles chunk of size N/64.

On the other hand, gridification is doing to transform to:

_7 = __builtin_omp_get_thread_num ();
dest[_7] = src[_7];

and the kernels is offloaded like this:
HSA debug: GOMP_OFFLOAD_run called with grid size 10000000 and group size 0

Performance numbers are in order of magnitude and can be seen on slides 27-30 in [1]

Martin

[1] https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=get&target=mjambor-hsa-slides.pdf

> 
> 
> Bernd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Merge of HSA branch
  2015-11-06 10:30   ` Richard Biener
@ 2015-11-06 11:03     ` Bernd Schmidt
  2015-11-06 11:33       ` Thomas Schwinge
  0 siblings, 1 reply; 44+ messages in thread
From: Bernd Schmidt @ 2015-11-06 11:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jakub Jelinek, Martin Liska, Michael Matz

On 11/06/2015 11:30 AM, Richard Biener wrote:
> On Fri, 6 Nov 2015, Bernd Schmidt wrote:
>>
>> Realistically we're probably not going to reject this work, but I still want
>> to ask whether the approach was acked by the community before you started. I'm
>> really not exactly thrilled about having two different classes of backends in
>> the compiler, and two different ways of handling offloading.
>
> Realistically the other approaches werent acked either (well, implicitely
> by review).

I think the LTO approach was discussed beforehand. As far as I remember 
(and Jakub may correct me) it was considered for intelmic, and Jakub had 
considerable input on it. I heard that it came up at the 2013 Cauldron.
Writing an rtl backend is the default thing to do for gcc and I would 
expect any other approach to be discussed beforehand.

> Not doing an RTL backend for NVPTX would have simplified
> your life as well.

I'm not convinced about this. At least I just had to turn off the 
register allocator, not write a new one.

Bernd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 10/12] HSAIL BRIG description header file (hopefully not a licensing issue)
  2015-11-05 22:05 ` [hsa 10/12] HSAIL BRIG description header file (hopefully not a licensing issue) Martin Jambor
@ 2015-11-06 11:29   ` Bernd Schmidt
  2015-11-06 12:45     ` Bernd Schmidt
  0 siblings, 1 reply; 44+ messages in thread
From: Bernd Schmidt @ 2015-11-06 11:29 UTC (permalink / raw)
  To: GCC Patches, David Edelsohn

On 11/05/2015 11:05 PM, Martin Jambor wrote:
> Initially, I have created the file by copying out pieces of PDF
> documentation but the latest version of the file (describing final
> HSAIL 1.0) is actually taken from the HSAIL (dis)assembler developed
> by HSA foundation and released by "University of Illinois/NCSA Open
> Source License."
>
> The license is "GPL-compatible" according to FSF
> (http://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses)
> so I believe that means we can put it inside GCC and I hope I also do
> not need any special steering committee approval or whatnot.  At the
> same time, the license comes with three restrictions that I hope I
> have fulfilled by keeping them in the header comment.  Nevertheless,
> if anybody knowledgeable can tell me what is the known right thing to
> do (or to confirm this is indeed the right thing to do), I'll be very
> happy.

It's not something I as a reviewer would want to decide. so I think this 
really is a question for the Steering Committee - they might not know 
the answer either but they can ask the FSF.

David Cc'ed so he can take the necessary steps.


Bernd

> +/* HSAIL and BRIG related macros and definitions.
> +   Copyright (c) 2013-2015, Advanced Micro Devices, Inc.
> +   Copyright (C) 2013-2015 Free Software Foundation, Inc.
> +
> +   Majority of contents in this file has originally been distributed under the
> +   University of Illinois/NCSA Open Source License.  This license mandates that
> +   the following conditions are observed when distributing this file:
> +
> +     * Redistributions of source code must retain the above copyright notice,
> +       this list of conditions and the following disclaimers.
> +
> +     * Redistributions in binary form must reproduce the above copyright notice,
> +       this list of conditions and the following disclaimers in the
> +       documentation and/or other materials provided with the distribution.
> +
> +     * Neither the names of the HSA Team, HSA Foundation, University of
> +       Illinois at Urbana-Champaign, nor the names of its contributors may be
> +       used to endorse or promote products derived from this Software without
> +       specific prior written permission.
> +
> +   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> +   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> +   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
> +   CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> +   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> +   FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> +   WITH THE SOFTWARE.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Merge of HSA branch
  2015-11-06 11:03     ` Bernd Schmidt
@ 2015-11-06 11:33       ` Thomas Schwinge
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Schwinge @ 2015-11-06 11:33 UTC (permalink / raw)
  To: Bernd Schmidt, Richard Biener
  Cc: GCC Patches, Jakub Jelinek, Martin Liska, Michael Matz, Torvald Riegel

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

Hi!

On Fri, 6 Nov 2015 12:03:25 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 11/06/2015 11:30 AM, Richard Biener wrote:
> > On Fri, 6 Nov 2015, Bernd Schmidt wrote:
> >>
> >> Realistically we're probably not going to reject this work, but I still want
> >> to ask whether the approach was acked by the community before you started. I'm
> >> really not exactly thrilled about having two different classes of backends in
> >> the compiler, and two different ways of handling offloading.
> >
> > Realistically the other approaches werent acked either (well, implicitely
> > by review).
> 
> I think the LTO approach was discussed beforehand. As far as I remember 
> (and Jakub may correct me) it was considered for intelmic, and Jakub had 
> considerable input on it. I heard that it came up at the 2013 Cauldron.
> Writing an rtl backend is the default thing to do for gcc and I would 
> expect any other approach to be discussed beforehand.
> 
> > Not doing an RTL backend for NVPTX would have simplified
> > your life as well.
> 
> I'm not convinced about this. At least I just had to turn off the 
> register allocator, not write a new one.

From the notes of the Accelerator BoF at the GNU Tools Cauldron 2013,
<http://news.gmane.org/find-root.php?message_id=%3C1375103926.7129.7694.camel%40triegel.csb%3E>:

| The main issue we discussed in the backend category was how to target
| more than one ISA when generating code (i.e., we need code in the host's
| ISA and in the accelerator(s)' (virtual) ISA(s)).  Multi-target support
| in GCC might be one option, but would probably need quite some time and
| thus depending on it would probably delay the accelerator efforts.  It
| might be simpler to stream code several times to different backends
| using the LTO infrastructure.  [...] A third
| option that SuSE is experimenting with is not writing a new backend but
| instead generating code right after the last GIMPLE pass; however, HSAIL
| needs register allocation, so it was noted that writing a light-weight
| backend might be
| easier.

Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 10/12] HSAIL BRIG description header file (hopefully not a licensing issue)
  2015-11-06 11:29   ` Bernd Schmidt
@ 2015-11-06 12:45     ` Bernd Schmidt
  0 siblings, 0 replies; 44+ messages in thread
From: Bernd Schmidt @ 2015-11-06 12:45 UTC (permalink / raw)
  To: GCC Patches, David Edelsohn

On 11/06/2015 12:29 PM, Bernd Schmidt wrote:
> David Cc'ed so he can take the necessary steps.
>

> Initially, I have created the file by copying out pieces of PDF
> documentation but the latest version of the file (describing final
> HSAIL 1.0) is actually taken from the HSAIL (dis)assembler developed
> by HSA foundation and released by "University of Illinois/NCSA Open
> Source License."

Actually there's not just the question of license, but also of copyright 
assignment.


Bernd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 4/12] OpenMP lowering/expansion changes (gridification)
  2015-11-05 21:57 ` [hsa 4/12] OpenMP lowering/expansion changes (gridification) Martin Jambor
@ 2015-11-09 10:02   ` Martin Jambor
  2015-11-12 11:16   ` Jakub Jelinek
  1 sibling, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-09 10:02 UTC (permalink / raw)
  To: GCC Patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

Hi,

On Thu, Nov 05, 2015 at 10:57:33PM +0100, Martin Jambor wrote:
> 
 ... 
> 
> For convenience of anybody reviewing the code, I'm attaching a very
> simple testcase with selection of dumps that illustrate the whole
> process.
> 

My apologies, I have forgotten to attach the file, so let me quickly
correct that now.  The tar file consists of the source and a selection
of dumps generated by a compilation with "-fopenmp -O -S
-fdump-tree-all -fdump-tree-omplower-details" flags.

Thanks,

Martin

[-- Attachment #2: plusone.tgz --]
[-- Type: application/x-compressed-tar, Size: 6590 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 1/12] Configuration and offloading-related changes
  2015-11-05 22:47   ` Joseph Myers
@ 2015-11-09 16:57     ` Martin Jambor
  0 siblings, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-09 16:57 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GCC Patches, Jakub Jelinek

On Thu, Nov 05, 2015 at 10:47:15PM +0000, Joseph Myers wrote:
> On Thu, 5 Nov 2015, Martin Jambor wrote:
> 
> > libgomp plugin to be built.  Because the plugin needs to use HSA
> > run-time library, I have introduced options --with-hsa-runtime (and
> > more precise --with-hsa-include and --with-hsa-lib) to help find it.
> 
> New configure options should be documented in install.texi.

Right, I am about to commit the following patch to the branch (I have
checked the make info output and it looks good).

Thanks for the reminder.

Martin

[hsa] Document configuration changes

2015-11-09  Martin Jambor  <mjambor@suse.cz>

	* install.texi (Configuration): Describe hsa --enable-offload-targets
	option, add description of --with-hsa-runtime,
	--with-hsa-runtime-include and --with-hsa-runtime-lib

diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index 57399ed..6984c40 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -1982,6 +1982,22 @@ specifying paths @var{path1}, @dots{}, @var{pathN}.
 % @var{srcdir}/configure \
     --enable-offload-target=i686-unknown-linux-gnu=/path/to/i686/compiler,x86_64-pc-linux-gnu
 @end smallexample
+
+If @samp{hsa} is specified as one of the targets, the compiler will be
+built with support for HSA GPU accelerators.  Because the same
+compiler will emit the accelerator code, no path should be specified.
+
+@item --with-hsa-runtime=@var{pathname}
+@itemx --with-hsa-runtime-include=@var{pathname}
+@itemx --with-hsa-runtime-lib=@var{pathname}
+
+If you configure GCC with HSA offloading but do not have the HSA
+run-time library installed in a standard location then you can
+explicitely specify the directory where they are installed.  The
+@option{--with-hsa=@/@var{hsainstalldir}} option is a shorthand for
+@option{--with-hsa-lib=@/@var{hsainstalldir}/lib} and
+@option{--with-hsa-include=@/@var{hsainstalldir}/include}.
+
 @end table
 
 @subheading Cross-Compiler-Specific Options

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 3/12] HSA libgomp plugin
  2015-11-05 22:47   ` Joseph Myers
@ 2015-11-09 16:58     ` Martin Jambor
  0 siblings, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-09 16:58 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GCC Patches, Jakub Jelinek

Hi,

On Thu, Nov 05, 2015 at 10:47:44PM +0000, Joseph Myers wrote:
> This new file should have the standard libgomp copyright / license notice.
> 

Oops, thanks for pointing this out.  I am about to commit the
following remedy to the branch.

Thanks,

Martin


2015-11-09  Martin Jambor  <mjambor@suse.cz>

	* plugin-hsa.c: Add the standard copyright header.

diff --git a/libgomp/plugin/plugin-hsa.c b/libgomp/plugin/plugin-hsa.c
index c1b7879..470b892 100644
--- a/libgomp/plugin/plugin-hsa.c
+++ b/libgomp/plugin/plugin-hsa.c
@@ -1,3 +1,32 @@
+/* Plugin for HSAIL execution.
+
+   Copyright (C) 2013-2015 Free Software Foundation, Inc.
+
+   Contributed by Martin Jambor <mjambor@suse.cz> and
+   Martin Liska <mliska@suse.cz>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 5/12] New HSA-related GCC options
  2015-11-06  8:42   ` Richard Biener
@ 2015-11-09 16:59     ` Martin Jambor
  2015-11-10  9:01       ` Richard Biener
  2015-11-12 11:19       ` Jakub Jelinek
  0 siblings, 2 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-09 16:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Hi,

On Fri, Nov 06, 2015 at 09:42:25AM +0100, Richard Biener wrote:
> On Thu, 5 Nov 2015, Martin Jambor wrote:
> 
> > Hi,
> > 
> > the following small part of the merge deals with new options.  It adds
> > four independent things:
> > 
> > 1) flag_disable_hsa is used by code in opts.c (in the first patch) to
> >    remember whether HSA has been explicitely disabled on the compiler
> >    command line.
> 
> But I don't see any way to disable it on the command line?  (no switch?)

No, the switch is -foffload, which has missing documentation (PR
67300) and is only described at https://gcc.gnu.org/wiki/Offloading
Nevertheless, the option allows the user to specify compiler option
-foffload=disable and no offloading should happen, not even HSA.  The
user can also enumerate just the offload targets they want (and pass
them special command line stuff).

It seems I have misplaced a hunk in the patch series.  Nevertheless,
in the first patch (with configuration stuff), there is a change to
opts.c which scans the -foffload= contents and sets the flag variable
if hsa is not present.

Whenever the compiler has to decide whether HSA is enabled for the
given compilation or not, it has to look at this variable (if
configured for HSA).

> 
> > 2) -Whsa is a new warning we emit whenever we fail to produce HSAIL
> >    for some source code.  It is on by default but of course only
> >    emitted by HSAIL generating code so should never affect anybody who
> >    does not use HSA-enabled compiler and OpenMP 4 device constructs.
> > 
> > We have found the following two additions very useful for debugging on
> > the branch but will understand if they are not deemed suitable for
> > trunk and will gladly remove them:
> > 
> > 3) -fdisable-hsa-gridification disables the gridification process to
> >    ease experimenting with dynamic parallelism.  With this option,
> >    HSAIL is always generated from the CPU-intended gimple.
> 
> So this sounds like sth a user should never do which means
> it shouln't be a switch (but a parameter or removed).

Martin said he likes the capability to switch gridification off so I
turned it into a parameter.

> 
> > 4) Parameter hsa-gen-debug-stores will be obsolete once HSA run-time
> >    supports debugging traps.  Before that, we have to do with
> >    debugging stores to memory at defined places, which however can
> >    cost speed in benchmarks.  So they are only enabled with this
> >    parameter.  We decided to make it a parameter rather than a switch
> >    to emphasize the fact it will go away and to possibly allow us
> >    select different levels of verbosity of the stores in the future).
> 
> You miss documentation in invoke.texi for new switches and parameters.

Right, I have added that together with other changes addressing the
above comments and am about to commit the following to the branch:


2015-11-09  Martin Jambor  <mjambor@suse.cz>

	* common.opt (-fdisable-hsa-gridification): Removed.
	* params.def (PARAM_OMP_GPU_GRIDIFY): New.
	* omp-low.c: Include params.h.
	(execute_lower_omp): Check parameter PARAM_OMP_GPU_GRIDIFY instead of
	flag_disable-hsa-gridification.
	* doc/invoke.texi (Optimize Options): Add description of
	omp-gpu-gridify and hsa-gen-debug-stores parameters.

diff --git a/gcc/common.opt b/gcc/common.opt
index 9cb52db..8bee504 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1115,10 +1115,6 @@ fdiagnostics-show-location=
 Common Joined RejectNegative Enum(diagnostic_prefixing_rule)
 -fdiagnostics-show-location=[once|every-line]	How often to emit source location at the beginning of line-wrapped diagnostics.
 
-fdisable-hsa-gridification
-Common Report Var(flag_disable_hsa_gridification)
-Disable HSA gridification for OMP pragmas
-
 ; Required for these enum values.
 SourceInclude
 pretty-print.h
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4fc7d88..b9fb1e1 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -11171,6 +11171,17 @@ dynamic, guided, auto, runtime).  The default is static.
 Maximum depth of recursion when querying properties of SSA names in things
 like fold routines.  One level of recursion corresponds to following a
 use-def chain.
+
+@item omp-gpu-gridify
+Enable creation of gridified GPU kernels out of loops within target
+OpenMP constructs.  This conversion is enabled by default when
+offloading to HSA, to disable it, use @option{--param omp-gpu-gridify=0}
+
+@item hsa-gen-debug-stores
+Enable emission of special debug stores within HSA kernels which are
+then read and reported by libgomp plugin.  Generation of these stores
+is disabled by default, use @option{--param hsa-gen-debug-stores=1} to
+enable it.
 @end table
 @end table
 
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 34aafc8..f90a698 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -82,6 +82,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-pretty-print.h"
 #include "symbol-summary.h"
 #include "hsa.h"
+#include "params.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -17449,7 +17450,8 @@ execute_lower_omp (void)
 
   body = gimple_body (current_function_decl);
 
-  if (hsa_gen_requested_p () && !flag_disable_hsa_gridification)
+  if (hsa_gen_requested_p ()
+      && PARAM_VALUE (PARAM_OMP_GPU_GRIDIFY) == 1)
     create_target_gpukernels (&body);
 
   scan_omp (&body, NULL);
diff --git a/gcc/params.def b/gcc/params.def
index 86911e2..f12755b 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1178,6 +1178,12 @@ DEFPARAM (PARAM_MAX_SSA_NAME_QUERY_DEPTH,
 	  " SSA name.",
 	  2, 1, 0)
 
+DEFPARAM (PARAM_OMP_GPU_GRIDIFY,
+	  "omp-gpu-gridify",
+	  "Enable creation of gridified GPU kernels out of OpenMP target "
+	  "constructs",
+	  1, 0, 1)
+
 DEFPARAM (PARAM_HSA_GEN_DEBUG_STORES,
 	  "hsa-gen-debug-stores",
 	  "Level of hsa debug stores verbosity",

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 9/12] Small alloc-pool fix
  2015-11-06  9:57       ` Richard Biener
@ 2015-11-10  8:48         ` Martin Liška
  2015-11-10 10:07           ` Richard Biener
  0 siblings, 1 reply; 44+ messages in thread
From: Martin Liška @ 2015-11-10  8:48 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 5201 bytes --]

On 11/06/2015 10:57 AM, Richard Biener wrote:
> On Fri, 6 Nov 2015, Martin LiÅ¡ka wrote:
> 
>> On 11/06/2015 10:00 AM, Richard Biener wrote:
>>> On Thu, 5 Nov 2015, Martin Jambor wrote:
>>>
>>>> Hi,
>>>>
>>>> we use C++ new operators based on alloc-pools a lot in the subsequent
>>>> patches and realized that on the current trunk, such new operators
>>>> would needlessly call the placement ::new operator within the allocate
>>>> method of pool-alloc.  Fixed below by providing a new allocation
>>>> method which does not call placement new, which is only safe to use
>>>> from within a new operator.
>>>>
>>>> The patch also fixes the slightly weird two parameter operator new
>>>> (which we do not use in HSA backend) so that it does not do the same.
>>>
>>
>> Hi.
>>
>>> Why do you need to add the pointer variant then?
>>
>> You are right, we originally used the variant in the branch, but it was eventually
>> left.
>>
>>>
>>> Also isn't the issue with allocate() that it does
>>>
>>>     return ::new (m_allocator.allocate ()) T ();
>>>
>>> which 1) value-initializes and 2) doesn't even work with types like
>>>
>>> struct T { T(int); };
>>>
>>> thus types without a default constructor.
>>
>> You are right, it produces compilation error.
>>
>>>
>>> I think the allocator was poorly C++-ified without updating the
>>> specification for the cases it is supposed to handle.  And now
>>> we have C++ uses that are not working because the allocator is
>>> broken.
>>>
>>> An incrementally better version (w/o fixing the issue with
>>> types w/o default constructor) is
>>>
>>>     return ::new (m_allocator.allocate ()) T;
>>
>> I've tried that, and it also calls default ctor:
>>
>> ../../gcc/alloc-pool.h: In instantiation of â€˜T* object_allocator<T>::allocate() [with T = et_occ]â€™:
>> ../../gcc/alloc-pool.h:531:22:   required from â€˜void* operator new(size_t, object_allocator<T>&) [with T = et_occ; size_t = long unsigned int]â€™
>> ../../gcc/et-forest.c:449:46:   required from here
>> ../../gcc/et-forest.c:58:3: error: â€˜et_occ::et_occ()â€™ is private
>>    et_occ ();
>>    ^
>> In file included from ../../gcc/et-forest.c:28:0:
>> ../../gcc/alloc-pool.h:483:44: error: within this context
>>      return ::new (m_allocator.allocate ()) T;
> 
> Yes, but it does slightly cheaper initialization of PODs
> 
>>
>>>
>>> thus default-initialize which does no initialization for PODs (without
>>> array members...) which is what the old pool allocator did.
>>
>> I'm not so familiar with differences related to PODs.
>>
>>>
>>> To fix the new operator (how do you even call that?  does it allow
>>> specifying constructor args and thus work without a default constructor?)
>>> it should indeed use an allocation method not performing the placement
>>> new.  But I'd call it allocate_raw rather than vallocate.
>>
>> For situations where do not have a default ctor, one should you the 
>> helper method defined at the end of alloc-pool.h:
>>
>> template <typename T>
>> inline void *
>> operator new (size_t, object_allocator<T> &a)
>> {
>>   return a.allocate ();
>> }
>>
>> For instance:
>> et_occ *nw = new (et_occurrences) et_occ (2);
> 
> Oh, so it uses placement new syntax...  works for me.
> 
>> or as used in the HSA branch:
>>
>> /* New operator to allocate convert instruction from pool alloc.  */
>>
>> void *
>> hsa_insn_cvt::operator new (size_t)
>> {
>>   return hsa_allocp_inst_cvt->allocate_raw ();
>> }
>>
>> and
>>
>> cvtinsn = new hsa_insn_cvt (reg, *ptmp2);
>>
>>
>> I attached patch where I rename the method as suggested.
> 
> Ok.

Hi.

I'm sending suggested patch that survives regression tests and bootstrap
on x86_64-linux-gnu.

Can I install the patch to trunk?
Thanks,
Martin

> 
> Thanks,
> Richard.
> 
>> Thanks,
>> Martin
>>
>>>
>>> Thanks.
>>> Richard.
>>>
>>>> Thanks,
>>>>
>>>> Martin
>>>>
>>>>
>>>> 2015-11-05  Martin Liska  <mliska@suse.cz>
>>>> 	    Martin Jambor  <mjambor@suse.cz>
>>>>
>>>> 	* alloc-pool.h (object_allocator::vallocate): New method.
>>>> 	(operator new): Call vallocate instead of allocate.
>>>> 	(operator new): New operator.
>>>>
>>>>
>>>> diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
>>>> index 0dc05cd..46b6550 100644
>>>> --- a/gcc/alloc-pool.h
>>>> +++ b/gcc/alloc-pool.h
>>>> @@ -483,6 +483,12 @@ public:
>>>>      return ::new (m_allocator.allocate ()) T ();
>>>>    }
>>>>  
>>>> +  inline void *
>>>> +  vallocate () ATTRIBUTE_MALLOC
>>>> +  {
>>>> +    return m_allocator.allocate ();
>>>> +  }
>>>> +
>>>>    inline void
>>>>    remove (T *object)
>>>>    {
>>>> @@ -523,12 +529,19 @@ struct alloc_pool_descriptor
>>>>  };
>>>>  
>>>>  /* Helper for classes that do not provide default ctor.  */
>>>> -
>>>>  template <typename T>
>>>>  inline void *
>>>>  operator new (size_t, object_allocator<T> &a)
>>>>  {
>>>> -  return a.allocate ();
>>>> +  return a.vallocate ();
>>>> +}
>>>> +
>>>> +/* Helper for classes that do not provide default ctor.  */
>>>> +template <typename T>
>>>> +inline void *
>>>> +operator new (size_t, object_allocator<T> *a)
>>>> +{
>>>> +  return a->vallocate ();
>>>>  }
>>>>  
>>>>  /* Hashtable mapping alloc_pool names to descriptors.  */
>>>>
>>>>
>>>
>>
>>
> 


[-- Attachment #2: 0001-Enhance-pool-allocator.patch --]
[-- Type: text/x-patch, Size: 1430 bytes --]

From acccc1c0f4bfe38b14c7dcc2c278c63b6484e91b Mon Sep 17 00:00:00 2001
From: marxin <mliska@suse.cz>
Date: Mon, 9 Nov 2015 16:52:21 +0100
Subject: [PATCH] Enhance pool allocator

gcc/ChangeLog:

2015-11-09  Martin Liska  <mliska@suse.cz>

	* alloc-pool.h (allocate_raw): New function.
	(operator new (size_t, object_allocator<T> &a)): Use the
	function instead of object_allocator::allocate).
---
 gcc/alloc-pool.h | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
index bf9b0eb..38aff28 100644
--- a/gcc/alloc-pool.h
+++ b/gcc/alloc-pool.h
@@ -477,12 +477,25 @@ public:
     m_allocator.release_if_empty ();
   }
 
+
+  /* Allocate memory for instance of type T and call a default constructor.  */
+
   inline T *
   allocate () ATTRIBUTE_MALLOC
   {
     return ::new (m_allocator.allocate ()) T;
   }
 
+  /* Allocate memory for instance of type T and return void * that
+     could be used in situations where a default constructor is not provided
+     by the class T.  */
+
+  inline void *
+  allocate_raw () ATTRIBUTE_MALLOC
+  {
+    return m_allocator.allocate ();
+  }
+
   inline void
   remove (T *object)
   {
@@ -528,7 +541,7 @@ template <typename T>
 inline void *
 operator new (size_t, object_allocator<T> &a)
 {
-  return a.allocate ();
+  return a.allocate_raw ();
 }
 
 /* Hashtable mapping alloc_pool names to descriptors.  */
-- 
2.6.2


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 5/12] New HSA-related GCC options
  2015-11-09 16:59     ` Martin Jambor
@ 2015-11-10  9:01       ` Richard Biener
  2015-11-12 11:19       ` Jakub Jelinek
  1 sibling, 0 replies; 44+ messages in thread
From: Richard Biener @ 2015-11-10  9:01 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches

On Mon, 9 Nov 2015, Martin Jambor wrote:

> Hi,
> 
> On Fri, Nov 06, 2015 at 09:42:25AM +0100, Richard Biener wrote:
> > On Thu, 5 Nov 2015, Martin Jambor wrote:
> > 
> > > Hi,
> > > 
> > > the following small part of the merge deals with new options.  It adds
> > > four independent things:
> > > 
> > > 1) flag_disable_hsa is used by code in opts.c (in the first patch) to
> > >    remember whether HSA has been explicitely disabled on the compiler
> > >    command line.
> > 
> > But I don't see any way to disable it on the command line?  (no switch?)
> 
> No, the switch is -foffload, which has missing documentation (PR
> 67300) and is only described at https://gcc.gnu.org/wiki/Offloading
> Nevertheless, the option allows the user to specify compiler option
> -foffload=disable and no offloading should happen, not even HSA.  The
> user can also enumerate just the offload targets they want (and pass
> them special command line stuff).
> 
> It seems I have misplaced a hunk in the patch series.  Nevertheless,
> in the first patch (with configuration stuff), there is a change to
> opts.c which scans the -foffload= contents and sets the flag variable
> if hsa is not present.
> 
> Whenever the compiler has to decide whether HSA is enabled for the
> given compilation or not, it has to look at this variable (if
> configured for HSA).
> 
> > 
> > > 2) -Whsa is a new warning we emit whenever we fail to produce HSAIL
> > >    for some source code.  It is on by default but of course only
> > >    emitted by HSAIL generating code so should never affect anybody who
> > >    does not use HSA-enabled compiler and OpenMP 4 device constructs.
> > > 
> > > We have found the following two additions very useful for debugging on
> > > the branch but will understand if they are not deemed suitable for
> > > trunk and will gladly remove them:
> > > 
> > > 3) -fdisable-hsa-gridification disables the gridification process to
> > >    ease experimenting with dynamic parallelism.  With this option,
> > >    HSAIL is always generated from the CPU-intended gimple.
> > 
> > So this sounds like sth a user should never do which means
> > it shouln't be a switch (but a parameter or removed).
> 
> Martin said he likes the capability to switch gridification off so I
> turned it into a parameter.
> 
> > 
> > > 4) Parameter hsa-gen-debug-stores will be obsolete once HSA run-time
> > >    supports debugging traps.  Before that, we have to do with
> > >    debugging stores to memory at defined places, which however can
> > >    cost speed in benchmarks.  So they are only enabled with this
> > >    parameter.  We decided to make it a parameter rather than a switch
> > >    to emphasize the fact it will go away and to possibly allow us
> > >    select different levels of verbosity of the stores in the future).
> > 
> > You miss documentation in invoke.texi for new switches and parameters.
> 
> Right, I have added that together with other changes addressing the
> above comments and am about to commit the following to the branch:

Looks good to me.

Thanks,
Richard.

> 
> 2015-11-09  Martin Jambor  <mjambor@suse.cz>
> 
> 	* common.opt (-fdisable-hsa-gridification): Removed.
> 	* params.def (PARAM_OMP_GPU_GRIDIFY): New.
> 	* omp-low.c: Include params.h.
> 	(execute_lower_omp): Check parameter PARAM_OMP_GPU_GRIDIFY instead of
> 	flag_disable-hsa-gridification.
> 	* doc/invoke.texi (Optimize Options): Add description of
> 	omp-gpu-gridify and hsa-gen-debug-stores parameters.
> 
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 9cb52db..8bee504 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1115,10 +1115,6 @@ fdiagnostics-show-location=
>  Common Joined RejectNegative Enum(diagnostic_prefixing_rule)
>  -fdiagnostics-show-location=[once|every-line]	How often to emit source location at the beginning of line-wrapped diagnostics.
>  
> -fdisable-hsa-gridification
> -Common Report Var(flag_disable_hsa_gridification)
> -Disable HSA gridification for OMP pragmas
> -
>  ; Required for these enum values.
>  SourceInclude
>  pretty-print.h
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 4fc7d88..b9fb1e1 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -11171,6 +11171,17 @@ dynamic, guided, auto, runtime).  The default is static.
>  Maximum depth of recursion when querying properties of SSA names in things
>  like fold routines.  One level of recursion corresponds to following a
>  use-def chain.
> +
> +@item omp-gpu-gridify
> +Enable creation of gridified GPU kernels out of loops within target
> +OpenMP constructs.  This conversion is enabled by default when
> +offloading to HSA, to disable it, use @option{--param omp-gpu-gridify=0}
> +
> +@item hsa-gen-debug-stores
> +Enable emission of special debug stores within HSA kernels which are
> +then read and reported by libgomp plugin.  Generation of these stores
> +is disabled by default, use @option{--param hsa-gen-debug-stores=1} to
> +enable it.
>  @end table
>  @end table
>  
> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
> index 34aafc8..f90a698 100644
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -82,6 +82,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-pretty-print.h"
>  #include "symbol-summary.h"
>  #include "hsa.h"
> +#include "params.h"
>  
>  /* Lowering of OMP parallel and workshare constructs proceeds in two
>     phases.  The first phase scans the function looking for OMP statements
> @@ -17449,7 +17450,8 @@ execute_lower_omp (void)
>  
>    body = gimple_body (current_function_decl);
>  
> -  if (hsa_gen_requested_p () && !flag_disable_hsa_gridification)
> +  if (hsa_gen_requested_p ()
> +      && PARAM_VALUE (PARAM_OMP_GPU_GRIDIFY) == 1)
>      create_target_gpukernels (&body);
>  
>    scan_omp (&body, NULL);
> diff --git a/gcc/params.def b/gcc/params.def
> index 86911e2..f12755b 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1178,6 +1178,12 @@ DEFPARAM (PARAM_MAX_SSA_NAME_QUERY_DEPTH,
>  	  " SSA name.",
>  	  2, 1, 0)
>  
> +DEFPARAM (PARAM_OMP_GPU_GRIDIFY,
> +	  "omp-gpu-gridify",
> +	  "Enable creation of gridified GPU kernels out of OpenMP target "
> +	  "constructs",
> +	  1, 0, 1)
> +
>  DEFPARAM (PARAM_HSA_GEN_DEBUG_STORES,
>  	  "hsa-gen-debug-stores",
>  	  "Level of hsa debug stores verbosity",
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 9/12] Small alloc-pool fix
  2015-11-10  8:48         ` Martin Liška
@ 2015-11-10 10:07           ` Richard Biener
  0 siblings, 0 replies; 44+ messages in thread
From: Richard Biener @ 2015-11-10 10:07 UTC (permalink / raw)
  To: Martin Liška; +Cc: GCC Patches

On Tue, Nov 10, 2015 at 9:47 AM, Martin Liška <mliska@suse.cz> wrote:
> On 11/06/2015 10:57 AM, Richard Biener wrote:
>> On Fri, 6 Nov 2015, Martin Liška wrote:
>>
>>> On 11/06/2015 10:00 AM, Richard Biener wrote:
>>>> On Thu, 5 Nov 2015, Martin Jambor wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> we use C++ new operators based on alloc-pools a lot in the subsequent
>>>>> patches and realized that on the current trunk, such new operators
>>>>> would needlessly call the placement ::new operator within the allocate
>>>>> method of pool-alloc.  Fixed below by providing a new allocation
>>>>> method which does not call placement new, which is only safe to use
>>>>> from within a new operator.
>>>>>
>>>>> The patch also fixes the slightly weird two parameter operator new
>>>>> (which we do not use in HSA backend) so that it does not do the same.
>>>>
>>>
>>> Hi.
>>>
>>>> Why do you need to add the pointer variant then?
>>>
>>> You are right, we originally used the variant in the branch, but it was eventually
>>> left.
>>>
>>>>
>>>> Also isn't the issue with allocate() that it does
>>>>
>>>>     return ::new (m_allocator.allocate ()) T ();
>>>>
>>>> which 1) value-initializes and 2) doesn't even work with types like
>>>>
>>>> struct T { T(int); };
>>>>
>>>> thus types without a default constructor.
>>>
>>> You are right, it produces compilation error.
>>>
>>>>
>>>> I think the allocator was poorly C++-ified without updating the
>>>> specification for the cases it is supposed to handle.  And now
>>>> we have C++ uses that are not working because the allocator is
>>>> broken.
>>>>
>>>> An incrementally better version (w/o fixing the issue with
>>>> types w/o default constructor) is
>>>>
>>>>     return ::new (m_allocator.allocate ()) T;
>>>
>>> I've tried that, and it also calls default ctor:
>>>
>>> ../../gcc/alloc-pool.h: In instantiation of ‘T* object_allocator<T>::allocate() [with T = et_occ]’:
>>> ../../gcc/alloc-pool.h:531:22:   required from ‘void* operator new(size_t, object_allocator<T>&) [with T = et_occ; size_t = long unsigned int]’
>>> ../../gcc/et-forest.c:449:46:   required from here
>>> ../../gcc/et-forest.c:58:3: error: ‘et_occ::et_occ()’ is private
>>>    et_occ ();
>>>    ^
>>> In file included from ../../gcc/et-forest.c:28:0:
>>> ../../gcc/alloc-pool.h:483:44: error: within this context
>>>      return ::new (m_allocator.allocate ()) T;
>>
>> Yes, but it does slightly cheaper initialization of PODs
>>
>>>
>>>>
>>>> thus default-initialize which does no initialization for PODs (without
>>>> array members...) which is what the old pool allocator did.
>>>
>>> I'm not so familiar with differences related to PODs.
>>>
>>>>
>>>> To fix the new operator (how do you even call that?  does it allow
>>>> specifying constructor args and thus work without a default constructor?)
>>>> it should indeed use an allocation method not performing the placement
>>>> new.  But I'd call it allocate_raw rather than vallocate.
>>>
>>> For situations where do not have a default ctor, one should you the
>>> helper method defined at the end of alloc-pool.h:
>>>
>>> template <typename T>
>>> inline void *
>>> operator new (size_t, object_allocator<T> &a)
>>> {
>>>   return a.allocate ();
>>> }
>>>
>>> For instance:
>>> et_occ *nw = new (et_occurrences) et_occ (2);
>>
>> Oh, so it uses placement new syntax...  works for me.
>>
>>> or as used in the HSA branch:
>>>
>>> /* New operator to allocate convert instruction from pool alloc.  */
>>>
>>> void *
>>> hsa_insn_cvt::operator new (size_t)
>>> {
>>>   return hsa_allocp_inst_cvt->allocate_raw ();
>>> }
>>>
>>> and
>>>
>>> cvtinsn = new hsa_insn_cvt (reg, *ptmp2);
>>>
>>>
>>> I attached patch where I rename the method as suggested.
>>
>> Ok.
>
> Hi.
>
> I'm sending suggested patch that survives regression tests and bootstrap
> on x86_64-linux-gnu.
>
> Can I install the patch to trunk?

Ok.

Thanks,
Richard.

> Thanks,
> Martin
>
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> Martin
>>>
>>>>
>>>> Thanks.
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> 2015-11-05  Martin Liska  <mliska@suse.cz>
>>>>>        Martin Jambor  <mjambor@suse.cz>
>>>>>
>>>>>    * alloc-pool.h (object_allocator::vallocate): New method.
>>>>>    (operator new): Call vallocate instead of allocate.
>>>>>    (operator new): New operator.
>>>>>
>>>>>
>>>>> diff --git a/gcc/alloc-pool.h b/gcc/alloc-pool.h
>>>>> index 0dc05cd..46b6550 100644
>>>>> --- a/gcc/alloc-pool.h
>>>>> +++ b/gcc/alloc-pool.h
>>>>> @@ -483,6 +483,12 @@ public:
>>>>>      return ::new (m_allocator.allocate ()) T ();
>>>>>    }
>>>>>
>>>>> +  inline void *
>>>>> +  vallocate () ATTRIBUTE_MALLOC
>>>>> +  {
>>>>> +    return m_allocator.allocate ();
>>>>> +  }
>>>>> +
>>>>>    inline void
>>>>>    remove (T *object)
>>>>>    {
>>>>> @@ -523,12 +529,19 @@ struct alloc_pool_descriptor
>>>>>  };
>>>>>
>>>>>  /* Helper for classes that do not provide default ctor.  */
>>>>> -
>>>>>  template <typename T>
>>>>>  inline void *
>>>>>  operator new (size_t, object_allocator<T> &a)
>>>>>  {
>>>>> -  return a.allocate ();
>>>>> +  return a.vallocate ();
>>>>> +}
>>>>> +
>>>>> +/* Helper for classes that do not provide default ctor.  */
>>>>> +template <typename T>
>>>>> +inline void *
>>>>> +operator new (size_t, object_allocator<T> *a)
>>>>> +{
>>>>> +  return a->vallocate ();
>>>>>  }
>>>>>
>>>>>  /* Hashtable mapping alloc_pool names to descriptors.  */
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 7/12] Disabling the vectorizer for GPU kernels/functions
  2015-11-06  8:38   ` Richard Biener
@ 2015-11-10 14:48     ` Martin Jambor
  2015-11-10 14:59       ` Richard Biener
  0 siblings, 1 reply; 44+ messages in thread
From: Martin Jambor @ 2015-11-10 14:48 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Martin Liska

On Fri, Nov 06, 2015 at 09:38:21AM +0100, Richard Biener wrote:
> On Thu, 5 Nov 2015, Martin Jambor wrote:
> 
> > Hi,
> > 
> > in the previous email I wrote we need to "change behavior" of a few
> > optimization passes.  One was the flattening of GPU functions and the
> > other two are in the patch below.  It all comes to that, at the
> > moment, we need to switch off the vectorizer (only for the GPU
> > functions, of course).
> > 
> > We are actually quite close to being able to handle gimple vector
> > input in HSA back-end but not all the way yet, and before allowing the
> > vectorizer again, we will have to make sure it never produces vectors
> > bigger than 128bits (in GPU functions).
> 
> Hmm.  I'd rather have this modify
> DECL_FUNCTION_SPECIFIC_OPTIMIZATION of the hsa function to get this
> effect.  I think I mentioned this to the OACC guys as well for a
> similar needs of them.

I see, that is a good idea.  I have reverted changes to
tree-ssa-loop.c and tree-vectorizer.c and on top of that committed the
following patch to the branch which makes modifications to HSA fndecls
at a more convenient spot and disables vectorization in the following
way:

  tree gdecl = gpu->decl;
  tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
  if (fn_opts == NULL_TREE)
    fn_opts = optimization_default_node;
  fn_opts = copy_node (fn_opts);
  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
  DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;

I hope that is what you meant.  I have also verified that it works.

Thanks,

Martin


2015-11-10  Martin Jambor  <mjambor@suse.cz>

	* hsa.h (hsa_summary_t): Add a comment to method link_functions.
	(hsa_summary_t::link_functions): Moved...
	* hsa.c (hsa_summary_t::link_functions): ...here.  Added common fndecl
	modifications.
	Include stringpool.h.
	* ipa-hsa.c (process_hsa_functions): Do not add flatten attribute
	here.  Fixed comments.

diff --git a/gcc/hsa.c b/gcc/hsa.c
index ab05a1d..e63be95 100644
--- a/gcc/hsa.c
+++ b/gcc/hsa.c
@@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "alloc-pool.h"
 #include "cgraph.h"
 #include "print-tree.h"
+#include "stringpool.h"
 #include "symbol-summary.h"
 #include "hsa.h"
 
@@ -693,6 +694,40 @@ hsa_get_declaration_name (tree decl)
   return NULL;
 }
 
+/* Couple GPU and HOST as gpu-specific and host-specific implementation of the
+   same function.  KIND determines whether GPU is a host-invokable kernel or
+   gpu-callable function.  */
+
+inline void
+hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
+			       hsa_function_kind kind)
+{
+  hsa_function_summary *gpu_summary = get (gpu);
+  hsa_function_summary *host_summary = get (host);
+
+  gpu_summary->m_kind = kind;
+  host_summary->m_kind = kind;
+
+  gpu_summary->m_gpu_implementation_p = true;
+  host_summary->m_gpu_implementation_p = false;
+
+  gpu_summary->m_binded_function = host;
+  host_summary->m_binded_function = gpu;
+
+  tree gdecl = gpu->decl;
+  DECL_ATTRIBUTES (gdecl)
+    = tree_cons (get_identifier ("flatten"), NULL_TREE,
+		 DECL_ATTRIBUTES (gdecl));
+
+  tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
+  if (fn_opts == NULL_TREE)
+    fn_opts = optimization_default_node;
+  fn_opts = copy_node (fn_opts);
+  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
+  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
+  DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
+}
+
 /* Add a HOST function to HSA summaries.  */
 
 void
diff --git a/gcc/hsa.h b/gcc/hsa.h
index 025de67..b6855ea 100644
--- a/gcc/hsa.h
+++ b/gcc/hsa.h
@@ -1161,27 +1161,14 @@ public:
   hsa_summary_t (symbol_table *table):
     function_summary<hsa_function_summary *> (table) { }
 
+  /* Couple GPU and HOST as gpu-specific and host-specific implementation of
+     the same function.  KIND determines whether GPU is a host-invokable kernel
+     or gpu-callable function.  */
+
   void link_functions (cgraph_node *gpu, cgraph_node *host,
 		       hsa_function_kind kind);
 };
 
-inline void
-hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
-			       hsa_function_kind kind)
-{
-  hsa_function_summary *gpu_summary = get (gpu);
-  hsa_function_summary *host_summary = get (host);
-
-  gpu_summary->m_kind = kind;
-  host_summary->m_kind = kind;
-
-  gpu_summary->m_gpu_implementation_p = true;
-  host_summary->m_gpu_implementation_p = false;
-
-  gpu_summary->m_binded_function = host;
-  host_summary->m_binded_function = gpu;
-}
-
 /* in hsa.c */
 extern struct hsa_function_representation *hsa_cfun;
 extern hash_map <tree, vec <const char *> *> *hsa_decl_kernel_dependencies;
diff --git a/gcc/ipa-hsa.c b/gcc/ipa-hsa.c
index b4cb58e..d77fa6b 100644
--- a/gcc/ipa-hsa.c
+++ b/gcc/ipa-hsa.c
@@ -90,16 +90,12 @@ process_hsa_functions (void)
 	  cgraph_node *clone = node->create_virtual_clone
 	    (vec <cgraph_edge *> (), NULL, NULL, "hsa");
 	  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
-	  if (s->m_kind == HSA_KERNEL)
-	    DECL_ATTRIBUTES (clone->decl)
-	      = tree_cons (get_identifier ("flatten"), NULL_TREE,
-			   DECL_ATTRIBUTES (clone->decl));
 
 	  clone->force_output = true;
 	  hsa_summaries->link_functions (clone, node, s->m_kind);
 
 	  if (dump_file)
-	    fprintf (dump_file, "HSA creates a new clone: %s, type: %s\n",
+	    fprintf (dump_file, "Created a new HSA clone: %s, type: %s\n",
 		     clone->name (),
 		     s->m_kind == HSA_KERNEL ? "kernel" : "function");
 	}
@@ -116,7 +112,7 @@ process_hsa_functions (void)
 	  hsa_summaries->link_functions (clone, node, HSA_FUNCTION);
 
 	  if (dump_file)
-	    fprintf (dump_file, "HSA creates a new function clone: %s\n",
+	    fprintf (dump_file, "Created a new HSA function clone: %s\n",
 		     clone->name ());
 	}
     }

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 7/12] Disabling the vectorizer for GPU kernels/functions
  2015-11-10 14:48     ` Martin Jambor
@ 2015-11-10 14:59       ` Richard Biener
  0 siblings, 0 replies; 44+ messages in thread
From: Richard Biener @ 2015-11-10 14:59 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Martin Liska

On Tue, 10 Nov 2015, Martin Jambor wrote:

> On Fri, Nov 06, 2015 at 09:38:21AM +0100, Richard Biener wrote:
> > On Thu, 5 Nov 2015, Martin Jambor wrote:
> > 
> > > Hi,
> > > 
> > > in the previous email I wrote we need to "change behavior" of a few
> > > optimization passes.  One was the flattening of GPU functions and the
> > > other two are in the patch below.  It all comes to that, at the
> > > moment, we need to switch off the vectorizer (only for the GPU
> > > functions, of course).
> > > 
> > > We are actually quite close to being able to handle gimple vector
> > > input in HSA back-end but not all the way yet, and before allowing the
> > > vectorizer again, we will have to make sure it never produces vectors
> > > bigger than 128bits (in GPU functions).
> > 
> > Hmm.  I'd rather have this modify
> > DECL_FUNCTION_SPECIFIC_OPTIMIZATION of the hsa function to get this
> > effect.  I think I mentioned this to the OACC guys as well for a
> > similar needs of them.
> 
> I see, that is a good idea.  I have reverted changes to
> tree-ssa-loop.c and tree-vectorizer.c and on top of that committed the
> following patch to the branch which makes modifications to HSA fndecls
> at a more convenient spot and disables vectorization in the following
> way:
> 
>   tree gdecl = gpu->decl;
>   tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
>   if (fn_opts == NULL_TREE)
>     fn_opts = optimization_default_node;
>   fn_opts = copy_node (fn_opts);
>   TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
>   TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
>   DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
> 
> I hope that is what you meant.  I have also verified that it works.

Yes, that's what I meant.

Thanks,
Richard.

> Thanks,
> 
> Martin
> 
> 
> 2015-11-10  Martin Jambor  <mjambor@suse.cz>
> 
> 	* hsa.h (hsa_summary_t): Add a comment to method link_functions.
> 	(hsa_summary_t::link_functions): Moved...
> 	* hsa.c (hsa_summary_t::link_functions): ...here.  Added common fndecl
> 	modifications.
> 	Include stringpool.h.
> 	* ipa-hsa.c (process_hsa_functions): Do not add flatten attribute
> 	here.  Fixed comments.
> 
> diff --git a/gcc/hsa.c b/gcc/hsa.c
> index ab05a1d..e63be95 100644
> --- a/gcc/hsa.c
> +++ b/gcc/hsa.c
> @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "alloc-pool.h"
>  #include "cgraph.h"
>  #include "print-tree.h"
> +#include "stringpool.h"
>  #include "symbol-summary.h"
>  #include "hsa.h"
>  
> @@ -693,6 +694,40 @@ hsa_get_declaration_name (tree decl)
>    return NULL;
>  }
>  
> +/* Couple GPU and HOST as gpu-specific and host-specific implementation of the
> +   same function.  KIND determines whether GPU is a host-invokable kernel or
> +   gpu-callable function.  */
> +
> +inline void
> +hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
> +			       hsa_function_kind kind)
> +{
> +  hsa_function_summary *gpu_summary = get (gpu);
> +  hsa_function_summary *host_summary = get (host);
> +
> +  gpu_summary->m_kind = kind;
> +  host_summary->m_kind = kind;
> +
> +  gpu_summary->m_gpu_implementation_p = true;
> +  host_summary->m_gpu_implementation_p = false;
> +
> +  gpu_summary->m_binded_function = host;
> +  host_summary->m_binded_function = gpu;
> +
> +  tree gdecl = gpu->decl;
> +  DECL_ATTRIBUTES (gdecl)
> +    = tree_cons (get_identifier ("flatten"), NULL_TREE,
> +		 DECL_ATTRIBUTES (gdecl));
> +
> +  tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
> +  if (fn_opts == NULL_TREE)
> +    fn_opts = optimization_default_node;
> +  fn_opts = copy_node (fn_opts);
> +  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
> +  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
> +  DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
> +}
> +
>  /* Add a HOST function to HSA summaries.  */
>  
>  void
> diff --git a/gcc/hsa.h b/gcc/hsa.h
> index 025de67..b6855ea 100644
> --- a/gcc/hsa.h
> +++ b/gcc/hsa.h
> @@ -1161,27 +1161,14 @@ public:
>    hsa_summary_t (symbol_table *table):
>      function_summary<hsa_function_summary *> (table) { }
>  
> +  /* Couple GPU and HOST as gpu-specific and host-specific implementation of
> +     the same function.  KIND determines whether GPU is a host-invokable kernel
> +     or gpu-callable function.  */
> +
>    void link_functions (cgraph_node *gpu, cgraph_node *host,
>  		       hsa_function_kind kind);
>  };
>  
> -inline void
> -hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
> -			       hsa_function_kind kind)
> -{
> -  hsa_function_summary *gpu_summary = get (gpu);
> -  hsa_function_summary *host_summary = get (host);
> -
> -  gpu_summary->m_kind = kind;
> -  host_summary->m_kind = kind;
> -
> -  gpu_summary->m_gpu_implementation_p = true;
> -  host_summary->m_gpu_implementation_p = false;
> -
> -  gpu_summary->m_binded_function = host;
> -  host_summary->m_binded_function = gpu;
> -}
> -
>  /* in hsa.c */
>  extern struct hsa_function_representation *hsa_cfun;
>  extern hash_map <tree, vec <const char *> *> *hsa_decl_kernel_dependencies;
> diff --git a/gcc/ipa-hsa.c b/gcc/ipa-hsa.c
> index b4cb58e..d77fa6b 100644
> --- a/gcc/ipa-hsa.c
> +++ b/gcc/ipa-hsa.c
> @@ -90,16 +90,12 @@ process_hsa_functions (void)
>  	  cgraph_node *clone = node->create_virtual_clone
>  	    (vec <cgraph_edge *> (), NULL, NULL, "hsa");
>  	  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
> -	  if (s->m_kind == HSA_KERNEL)
> -	    DECL_ATTRIBUTES (clone->decl)
> -	      = tree_cons (get_identifier ("flatten"), NULL_TREE,
> -			   DECL_ATTRIBUTES (clone->decl));
>  
>  	  clone->force_output = true;
>  	  hsa_summaries->link_functions (clone, node, s->m_kind);
>  
>  	  if (dump_file)
> -	    fprintf (dump_file, "HSA creates a new clone: %s, type: %s\n",
> +	    fprintf (dump_file, "Created a new HSA clone: %s, type: %s\n",
>  		     clone->name (),
>  		     s->m_kind == HSA_KERNEL ? "kernel" : "function");
>  	}
> @@ -116,7 +112,7 @@ process_hsa_functions (void)
>  	  hsa_summaries->link_functions (clone, node, HSA_FUNCTION);
>  
>  	  if (dump_file)
> -	    fprintf (dump_file, "HSA creates a new function clone: %s\n",
> +	    fprintf (dump_file, "Created a new HSA function clone: %s\n",
>  		     clone->name ());
>  	}
>      }
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 2/12] Modifications to libgomp proper
  2015-11-05 21:54 ` [hsa 2/12] Modifications to libgomp proper Martin Jambor
@ 2015-11-12 10:11   ` Jakub Jelinek
  2015-11-12 13:22     ` Thomas Schwinge
  0 siblings, 1 reply; 44+ messages in thread
From: Jakub Jelinek @ 2015-11-12 10:11 UTC (permalink / raw)
  To: GCC Patches

On Thu, Nov 05, 2015 at 10:54:42PM +0100, Martin Jambor wrote:
> The patch below contains all changes to libgomp files.  First, it adds
> a new constant identifying HSA devices and a structure that is shared
> between libgomp and the compiler when kernels from kernels are invoked
> via dynamic parallelism.
> 
> Second it modifies the GOMP_target_41 function so that it also can take
> kernel attributes (essentially the grid dimension) as a parameter and
> pass it on the HSA libgomp plugin.  Because we do want HSAIL
> generation to gracefully fail and use host fallback in that case, the
> same function calls the host implementation if it cannot map the
> requested function to an accelerated one or of a new callback
> can_run_func indicates there is a problem.
> 
> We need a new hook because we use it to check for linking errors which
> we cannot do when incrementally loading registered images.  And we
> want to handle linking errors, so that when we cannot emit HSAIL for a
> function called from a kernel (possibly in a different compilation
> unit), we also resort to host fallback.
> 
> Last but not least, the patch removes data remapping when the selected
> device is capable of sharing memory with the host.

The patch clearly is not against current trunk, there is no GOMP_target_41
function, the GOMP_target_ext function has extra arguments, etc.

> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index 9c8b1fb..0ad42d2 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -876,7 +876,8 @@ struct gomp_device_descr
>    void *(*dev2host_func) (int, void *, const void *, size_t);
>    void *(*host2dev_func) (int, void *, const void *, size_t);
>    void *(*dev2dev_func) (int, void *, const void *, size_t);
> -  void (*run_func) (int, void *, void *);
> +  void (*run_func) (int, void *, void *, const void *);

Adding arguments to existing plugin methods is a plugin ABI incompatible
change.  We now have:
  DLSYM (version);
  if (device->version_func () != GOMP_VERSION)
    {
      err = "plugin version mismatch";
      goto fail;
    }
so there is a way to deal with it, but you need to adjust all plugins.
See below anyway.

> --- a/libgomp/oacc-host.c
> +++ b/libgomp/oacc-host.c
> @@ -123,7 +123,8 @@ host_host2dev (int n __attribute__ ((unused)),
>  }
>  
>  static void
> -host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars)
> +host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars,
> +	  const void* kern_launch __attribute__ ((unused)))

This is C, space before * not after it.
>  {
>    void (*fn)(void *) = (void (*)(void *)) fn_ptr;

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -1248,7 +1248,12 @@ gomp_get_target_fn_addr (struct gomp_device_descr *devicep,
>        splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
>        gomp_mutex_unlock (&devicep->lock);
>        if (tgt_fn == NULL)
> -	gomp_fatal ("Target function wasn't mapped");
> +	{
> +	  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> +	    return NULL;
> +	  else
> +	    gomp_fatal ("Target function wasn't mapped");
> +	}
>  
>        return (void *) tgt_fn->tgt_offset;
>      }
> @@ -1276,6 +1281,7 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
>      return gomp_target_fallback (fn, hostaddrs);
>  
>    void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
> +  assert (fn_addr);

I must say I really don't like putting asserts into libgomp, in production
it is after all not built with -D_NDEBUG.  But this shows a worse problem,
if you have GCC 5 compiled OpenMP code, of course there won't be HSA
offloaded copy, but if you try to run it on a box with HSA offloading
enabled, you can run into this assertion failure.
Supposedly the old APIs (GOMP_target, GOMP_target_update, GOMP_target_data)
should treat GOMP_OFFLOAD_CAP_SHARED_MEM capable devices as unconditional
device fallback?

> @@ -1297,7 +1304,7 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
>  void
>  GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
>  		void **hostaddrs, size_t *sizes, unsigned short *kinds,
> -		unsigned int flags, void **depend)
> +		unsigned int flags, void **depend, const void *kernel_launch)

GOMP_target_ext has different arguments, you get the num_teams and
thread_limit clauses values in there already (if known at compile time or
before entering target region; 0 stands for implementation defined choice,
-1 for unknown before GOMP_target_ext).
Plus I must say I really don't like the addition of HSA specific argument
to the API, it is unclean and really doesn't scale, when somebody adds
support for another offloading target, would we add again another argument?
Can't use the same one, because one could have configured both HSA and that
other kind offloading at the same time and which one is picked would be only
a runtime decision, based on env vars of omp_set_default_device etc.
num_teams/thread_limit, as runtime arguments, you already get on the trunk.
For compile time decided values, those should go into some data section
and be somehow attached to what fn is translated into in the AVL tree (which
you really don't need to use for variables on GOMP_OFFLOAD_CAP_SHARED_MEM
obviously, but can still use for the kernels, and populate during
registration of the offloading region).

>  {
>    struct gomp_device_descr *devicep = resolve_device (device);
>  
> @@ -1312,8 +1319,16 @@ GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
>  	gomp_task_maybe_wait_for_dependencies (depend);
>      }
>  
> +  void *fn_addr = NULL;
> +  bool host_fallback = false;
>    if (devicep == NULL
> -      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
> +      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
> +      || !(fn_addr = gomp_get_target_fn_addr (devicep, fn))
> +      || (devicep->can_run_func && !devicep->can_run_func (fn_addr)))
> +    host_fallback = true;
> +
> +  if (host_fallback
> +      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)

The part below is now in a different function.

>      {
>        size_t i, tgt_align = 0, tgt_size = 0;
>        char *tgt = NULL;
> @@ -1343,15 +1358,20 @@ GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
>  		tgt_size = tgt_size + sizes[i];
>  	      }
>  	}
> -      gomp_target_fallback (fn, hostaddrs);
> -      return;
> -    }
>  

	Jakub

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 4/12] OpenMP lowering/expansion changes (gridification)
  2015-11-05 21:57 ` [hsa 4/12] OpenMP lowering/expansion changes (gridification) Martin Jambor
  2015-11-09 10:02   ` Martin Jambor
@ 2015-11-12 11:16   ` Jakub Jelinek
  1 sibling, 0 replies; 44+ messages in thread
From: Jakub Jelinek @ 2015-11-12 11:16 UTC (permalink / raw)
  To: GCC Patches

On Thu, Nov 05, 2015 at 10:57:33PM +0100, Martin Jambor wrote:
> the patch in this email contains the changes to make our OpenMP
> lowering and expansion machinery produce GPU kernels for a certain
> limited class of loops.  The plan is to make that class quite a big
> bigger, but only the following is ready for submission now.
> 
> Basically, whenever the compiler configured for HSAIL generation
> encounters the following pattern:
> 
>   #pragma omp target
>   #pragma omp teams thread_limit(workgroup_size) // thread_limit is optional
>   #pragma omp distribute parallel for firstprivate(n) private(i) other_sharing_clauses()
>     for (i = 0; i < n; i++)
>       some_loop_body

Do you support only lb 0 or any constant?  Only step 1?  Can the
b be constant, or just a variable?  If you need the number of iterations
computed before GOMP_target_ext, supposedly you also need to check that
n can't change in between target and the distribute (e.g. if it is
addressable or global var) and there are some statements in between.

What about schedule or dist_schedule clauses?  Only schedule(auto) or
missing schedule guarantees you you can distribute the work among the
threads any way the compiler wants.
dist_schedule is always static, but could have different chunk_size.

The current int num_threads, int thread_limit GOMP_target_ext arguments
perhaps could be changed to something like int num_args, long *args,
where args[0] would be the current num_threads and args[1] current
thread_limit, and if any offloading target that might benefit from knowing
the number of iterations of distribute parallel for that is the only
important statement inside, you could perhaps pass it as args[2] and pass
3 instead of 2 to num_args.  That could be something kind of generic
rather than HSA specific, and extensible.  But, looking at your
kernel_launch structure, you want something like multiple dimensions and
compute each dimension separately rather than combine (collapse) all
dimensions together, which is what OpenMP expansion does right now.

> While we have also been experimenting quite a bit with dynamic
> parallelism, we have only been able to achieve any good performance
> via this process of gridification.  The user can be notified whether a
> particular target construct was gridified or not via our process of
> dumping notes, which however only appear in the detailed dump.  I am
> seriously considering emitting some kind of warning, when HSA-enabled
> compiler is about to produce a non-gridified target code.

But then it would warn pretty much on all of libgomp testsuite with target
constructs in them...

> @@ -547,13 +548,13 @@ DEF_FUNCTION_TYPE_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR,

> --- a/gcc/fortran/types.def
> +++ b/gcc/fortran/types.def
> @@ -145,6 +145,7 @@ DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I2_INT, BT_VOID, BT_VOLATILE_PTR, BT_I2, BT
>  DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I4_INT, BT_VOID, BT_VOLATILE_PTR, BT_I4, BT_INT)
>  DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I8_INT, BT_VOID, BT_VOLATILE_PTR, BT_I8, BT_INT)
>  DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I16_INT, BT_VOID, BT_VOLATILE_PTR, BT_I16, BT_INT)
> +DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_INT_PTR, BT_VOID, BT_PTR, BT_INT, BT_PTR)
>  
>  DEF_FUNCTION_TYPE_4 (BT_FN_VOID_OMPFN_PTR_UINT_UINT,
>                       BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT, BT_UINT)
> @@ -215,9 +216,9 @@ DEF_FUNCTION_TYPE_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR,
>  DEF_FUNCTION_TYPE_8 (BT_FN_VOID_OMPFN_PTR_UINT_LONG_LONG_LONG_LONG_UINT,
>  		     BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT,
>  		     BT_LONG, BT_LONG, BT_LONG, BT_LONG, BT_UINT)
> -DEF_FUNCTION_TYPE_8 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR,
> +DEF_FUNCTION_TYPE_9 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
>  		     BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
> -		     BT_PTR, BT_PTR, BT_UINT, BT_PTR)
> +		     BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_PTR)

You'd need to move it if you add arguments (but as I said on the other
patch, this won't really apply on top of the trunk anyway).

> --- a/gcc/gimple.h
> +++ b/gcc/gimple.h
> @@ -153,6 +153,7 @@ enum gf_mask {
>      GF_OMP_FOR_KIND_TASKLOOP	= 2,
>      GF_OMP_FOR_KIND_CILKFOR     = 3,
>      GF_OMP_FOR_KIND_OACC_LOOP	= 4,
> +    GF_OMP_FOR_KIND_KERNEL_BODY = 5,
>      /* Flag for SIMD variants of OMP_FOR kinds.  */
>      GF_OMP_FOR_SIMD		= 1 << 3,
>      GF_OMP_FOR_KIND_SIMD	= GF_OMP_FOR_SIMD | 0,
> @@ -621,8 +622,24 @@ struct GTY((tag("GSS_OMP_FOR")))
>    /* [ WORD 11 ]
>       Pre-body evaluated before the loop body begins.  */
>    gimple_seq pre_body;
> +
> +  /* [ WORD 12 ]
> +     If set, this statement is part of a gridified kernel, its clauses need to
> +     be scanned and lowered but the statement should be discarded after
> +     lowering.  */
> +  bool kernel_phony;

A bool flag is better put as a GF_OMP_* flag, there are still bits left
there.

> @@ -642,6 +659,26 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT")))
>    /* [ WORD 10 ]
>       Shared data argument.  */
>    tree data_arg;
> +
> +  /* TODO: Revisit placement of the following two fields.  On one hand, we
> +     currently only use them on target construct.  On the other, use on
> +     parallel construct is also possible in the future.  */
> +
> +  /* [ WORD 11 ] */
> +  /* Number of elements in kernel_iter array.  */
> +  size_t dimensions;
> +
> +  /* [ WORD 12 ] */
> +  /* If target also contains a GPU kernel, it should be run with the
> +     following grid sizes.  */
> +  struct gimple_omp_target_grid_dim
> +    * GTY((length ("%h.dimensions"))) kernel_dim;
> +
> +  /* [ WORD 13 ] */
> +  /* If set, this statement is part of a gridified kernel, its clauses need to
> +     be scanned and lowered but the statement should be discarded after
> +     lowering.  */
> +  bool kernel_phony;

I really don't like sticking any other arguments into these gimple
structures.  Add some artificial clause and add it to the construct's
clauses instead?

> --- a/gcc/omp-builtins.def
> +++ b/gcc/omp-builtins.def
> @@ -302,8 +302,12 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_START, "GOMP_single_copy_start",
>  		  BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_END, "GOMP_single_copy_end",
>  		  BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
> +DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_REGISTER, "GOMP_offload_register",
> +		  BT_FN_VOID_PTR_INT_PTR, ATTR_NOTHROW_LIST)
> +DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_UNREGISTER, "GOMP_offload_unregister",
> +		  BT_FN_VOID_PTR_INT_PTR, ATTR_NOTHROW_LIST)

These two are deprecated, use the *_ver ones instead.

>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TARGET, "GOMP_target_41",
> -		  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR,
> +		  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
>  		  ATTR_NOTHROW_LIST)

This won't really apply to trunk.

	Jakub

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 5/12] New HSA-related GCC options
  2015-11-09 16:59     ` Martin Jambor
  2015-11-10  9:01       ` Richard Biener
@ 2015-11-12 11:19       ` Jakub Jelinek
  2015-11-13 13:01         ` Martin Jambor
  1 sibling, 1 reply; 44+ messages in thread
From: Jakub Jelinek @ 2015-11-12 11:19 UTC (permalink / raw)
  To: Richard Biener, GCC Patches

On Mon, Nov 09, 2015 at 05:58:56PM +0100, Martin Jambor wrote:
> > But I don't see any way to disable it on the command line?  (no switch?)
> 
> No, the switch is -foffload, which has missing documentation (PR
> 67300) and is only described at https://gcc.gnu.org/wiki/Offloading
> Nevertheless, the option allows the user to specify compiler option
> -foffload=disable and no offloading should happen, not even HSA.  The
> user can also enumerate just the offload targets they want (and pass
> them special command line stuff).
> 
> It seems I have misplaced a hunk in the patch series.  Nevertheless,
> in the first patch (with configuration stuff), there is a change to
> opts.c which scans the -foffload= contents and sets the flag variable
> if hsa is not present.
> 
> Whenever the compiler has to decide whether HSA is enabled for the
> given compilation or not, it has to look at this variable (if
> configured for HSA).

Buut what is the difference between
-foffload=disable
or
-foffload={list not including hsa}
and the new param?  If you don't gridify, you don't emit any kernels...

	Jakub

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 2/12] Modifications to libgomp proper
  2015-11-12 10:11   ` Jakub Jelinek
@ 2015-11-12 13:22     ` Thomas Schwinge
  2015-11-12 14:11       ` Nathan Sidwell
  2015-11-12 15:59       ` Jakub Jelinek
  0 siblings, 2 replies; 44+ messages in thread
From: Thomas Schwinge @ 2015-11-12 13:22 UTC (permalink / raw)
  To: Jakub Jelinek, Nathan Sidwell; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 5444 bytes --]

Hi!

On Thu, 12 Nov 2015 11:11:33 +0100, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Nov 05, 2015 at 10:54:42PM +0100, Martin Jambor wrote:
> > --- a/libgomp/libgomp.h
> > +++ b/libgomp/libgomp.h
> > @@ -876,7 +876,8 @@ struct gomp_device_descr
> >    void *(*dev2host_func) (int, void *, const void *, size_t);
> >    void *(*host2dev_func) (int, void *, const void *, size_t);
> >    void *(*dev2dev_func) (int, void *, const void *, size_t);
> > -  void (*run_func) (int, void *, void *);
> > +  void (*run_func) (int, void *, void *, const void *);
> 
> Adding arguments to existing plugin methods is a plugin ABI incompatible
> change.  We now have:
>   DLSYM (version);
>   if (device->version_func () != GOMP_VERSION)
>     {
>       err = "plugin version mismatch";
>       goto fail;
>     }
> so there is a way to deal with it, but you need to adjust all plugins.

I'm confused -- didn't we agree that we don't need to maintain backwards
compatibility in the libgomp <-> plugins interface?  (Nathan?)  As far as
I remember, the argument was that libgomp and all its plugins will always
be built from the same source tree, so will be compatible with each
other, "by definition"?

(We do need, and have, versioning between GCC proper and libgomp
interfaces.)


> > --- a/libgomp/target.c
> > +++ b/libgomp/target.c
> > @@ -1248,7 +1248,12 @@ gomp_get_target_fn_addr (struct gomp_device_descr *devicep,
> >        splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
> >        gomp_mutex_unlock (&devicep->lock);
> >        if (tgt_fn == NULL)
> > -	gomp_fatal ("Target function wasn't mapped");
> > +	{
> > +	  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> > +	    return NULL;
> > +	  else
> > +	    gomp_fatal ("Target function wasn't mapped");
> > +	}
> >  
> >        return (void *) tgt_fn->tgt_offset;
> >      }
> > @@ -1276,6 +1281,7 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
> >      return gomp_target_fallback (fn, hostaddrs);
> >  
> >    void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
> > +  assert (fn_addr);
> 
> I must say I really don't like putting asserts into libgomp, in production
> it is after all not built with -D_NDEBUG.

I like them, because they help during development, and for getting
higher-quality bug reports from users, and they serve as source code
documentation.  Of course, I understand your -- I suppose -- performance
worries.  Does such an NULL checking assert -- hopefully marked as
"unlikely" -- cause any noticeable overhead, though?


> But this shows a worse problem,
> if you have GCC 5 compiled OpenMP code, of course there won't be HSA
> offloaded copy, but if you try to run it on a box with HSA offloading
> enabled, you can run into this assertion failure.

That's one of the issues that I'm working on resolving with my
"Forwarding -foffload=[...] from the driver (compile-time) to libgomp
(run-time)" patch,
<http://news.gmane.org/find-root.php?message_id=%3C87mvve95af.fsf%40schwinge.name%3E>.
In such a case (no GOMP_offload_register_ver call for HSA), HSA
offloading would not be considered (not "enabled") in libgomp.  (It'll be
two more weeks before I can make progress with that patch; will be
attending SuperComputing 2015 next week -- anyone else will be there,
too?)

> Supposedly the old APIs (GOMP_target, GOMP_target_update, GOMP_target_data)
> should treat GOMP_OFFLOAD_CAP_SHARED_MEM capable devices as unconditional
> device fallback?


> > @@ -1297,7 +1304,7 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
> >  void
> >  GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
> >  		void **hostaddrs, size_t *sizes, unsigned short *kinds,
> > -		unsigned int flags, void **depend)
> > +		unsigned int flags, void **depend, const void *kernel_launch)
> 
> GOMP_target_ext has different arguments, you get the num_teams and
> thread_limit clauses values in there already (if known at compile time or
> before entering target region; 0 stands for implementation defined choice,
> -1 for unknown before GOMP_target_ext).
> Plus I must say I really don't like the addition of HSA specific argument
> to the API, it is unclean and really doesn't scale, when somebody adds
> support for another offloading target, would we add again another argument?
> Can't use the same one, because one could have configured both HSA and that
> other kind offloading at the same time and which one is picked would be only
> a runtime decision, based on env vars of omp_set_default_device etc.
> num_teams/thread_limit, as runtime arguments, you already get on the trunk.
> For compile time decided values, those should go into some data section
> and be somehow attached to what fn is translated into in the AVL tree (which
> you really don't need to use for variables on GOMP_OFFLOAD_CAP_SHARED_MEM
> obviously, but can still use for the kernels, and populate during
> registration of the offloading region).

What about adopting the "tagging" scheme that we added for
libgomp/oacc-parallel.c:GOACC_parallel_keyed?  With support for other
offloading schemes being added one by one, isn't it quite likely that the
interface will need to be adjusted for each of them, because
more/different data will have to be transmitted from GCC proper to
libgomp?


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 2/12] Modifications to libgomp proper
  2015-11-12 13:22     ` Thomas Schwinge
@ 2015-11-12 14:11       ` Nathan Sidwell
  2015-11-12 15:59       ` Jakub Jelinek
  1 sibling, 0 replies; 44+ messages in thread
From: Nathan Sidwell @ 2015-11-12 14:11 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek; +Cc: GCC Patches

On 11/12/15 08:21, Thomas Schwinge wrote:
> Hi!
>

>> so there is a way to deal with it, but you need to adjust all plugins.
>
> I'm confused -- didn't we agree that we don't need to maintain backwards
> compatibility in the libgomp <-> plugins interface?  (Nathan?)

Indeed, no need to deal with version skew between libgomp and its plugins.

On 07/24/15 12:30, Jakub Jelinek wrote:
> And I'd say that we don't really need to maintain support for mixing libgomp
> from one GCC version and libgomp plugins from another version, worst case
> there should be some GOMP_OFFLOAD_get_version function that libgomp could
> use to verify it is talking to the right version of the plugin and
> completely ignore it if it gives wrong version.


> (We do need, and have, versioning between GCC proper and libgomp
> interfaces.)

Yes. (For avoidance of doubt)

nathan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 2/12] Modifications to libgomp proper
  2015-11-12 13:22     ` Thomas Schwinge
  2015-11-12 14:11       ` Nathan Sidwell
@ 2015-11-12 15:59       ` Jakub Jelinek
  1 sibling, 0 replies; 44+ messages in thread
From: Jakub Jelinek @ 2015-11-12 15:59 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Nathan Sidwell, GCC Patches

On Thu, Nov 12, 2015 at 02:21:56PM +0100, Thomas Schwinge wrote:
> > > --- a/libgomp/libgomp.h
> > > +++ b/libgomp/libgomp.h
> > > @@ -876,7 +876,8 @@ struct gomp_device_descr
> > >    void *(*dev2host_func) (int, void *, const void *, size_t);
> > >    void *(*host2dev_func) (int, void *, const void *, size_t);
> > >    void *(*dev2dev_func) (int, void *, const void *, size_t);
> > > -  void (*run_func) (int, void *, void *);
> > > +  void (*run_func) (int, void *, void *, const void *);
> > 
> > Adding arguments to existing plugin methods is a plugin ABI incompatible
> > change.  We now have:
> >   DLSYM (version);
> >   if (device->version_func () != GOMP_VERSION)
> >     {
> >       err = "plugin version mismatch";
> >       goto fail;
> >     }
> > so there is a way to deal with it, but you need to adjust all plugins.
> 
> I'm confused -- didn't we agree that we don't need to maintain backwards
> compatibility in the libgomp <-> plugins interface?  (Nathan?)  As far as
> I remember, the argument was that libgomp and all its plugins will always
> be built from the same source tree, so will be compatible with each
> other, "by definition"?
> 
> (We do need, and have, versioning between GCC proper and libgomp
> interfaces.)

I've mentioned the GOMP_VERSION check in there, and that all the plugins
would need to be modified, which was I think not shown in the patch.

> > But this shows a worse problem,
> > if you have GCC 5 compiled OpenMP code, of course there won't be HSA
> > offloaded copy, but if you try to run it on a box with HSA offloading
> > enabled, you can run into this assertion failure.
> 
> That's one of the issues that I'm working on resolving with my
> "Forwarding -foffload=[...] from the driver (compile-time) to libgomp
> (run-time)" patch,
> <http://news.gmane.org/find-root.php?message_id=%3C87mvve95af.fsf%40schwinge.name%3E>.
> In such a case (no GOMP_offload_register_ver call for HSA), HSA
> offloading would not be considered (not "enabled") in libgomp.  (It'll be
> two more weeks before I can make progress with that patch; will be
> attending SuperComputing 2015 next week -- anyone else will be there,
> too?)

You are aware of my objections to that and what does it do with later dlopened
libraries.

> > GOMP_target_ext has different arguments, you get the num_teams and
> > thread_limit clauses values in there already (if known at compile time or
> > before entering target region; 0 stands for implementation defined choice,
> > -1 for unknown before GOMP_target_ext).
> > Plus I must say I really don't like the addition of HSA specific argument
> > to the API, it is unclean and really doesn't scale, when somebody adds
> > support for another offloading target, would we add again another argument?
> > Can't use the same one, because one could have configured both HSA and that
> > other kind offloading at the same time and which one is picked would be only
> > a runtime decision, based on env vars of omp_set_default_device etc.
> > num_teams/thread_limit, as runtime arguments, you already get on the trunk.
> > For compile time decided values, those should go into some data section
> > and be somehow attached to what fn is translated into in the AVL tree (which
> > you really don't need to use for variables on GOMP_OFFLOAD_CAP_SHARED_MEM
> > obviously, but can still use for the kernels, and populate during
> > registration of the offloading region).
> 
> What about adopting the "tagging" scheme that we added for
> libgomp/oacc-parallel.c:GOACC_parallel_keyed?  With support for other
> offloading schemes being added one by one, isn't it quite likely that the
> interface will need to be adjusted for each of them, because
> more/different data will have to be transmitted from GCC proper to
> libgomp?

Perhaps something similar, but certainly not as varargs, that is a really
bad idea.  Instead of the long * array I've talked about perhaps use void **,
and put in unkeyed num_teams and thread_limit as the first two arguments
thereof, then add keyed arguments in there, terminate by 0.

	Jakub

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [hsa 5/12] New HSA-related GCC options
  2015-11-12 11:19       ` Jakub Jelinek
@ 2015-11-13 13:01         ` Martin Jambor
  0 siblings, 0 replies; 44+ messages in thread
From: Martin Jambor @ 2015-11-13 13:01 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Biener, GCC Patches

On Thu, Nov 12, 2015 at 12:19:50PM +0100, Jakub Jelinek wrote:
> On Mon, Nov 09, 2015 at 05:58:56PM +0100, Martin Jambor wrote:
> > > But I don't see any way to disable it on the command line?  (no switch?)
> > 
> > No, the switch is -foffload, which has missing documentation (PR
> > 67300) and is only described at https://gcc.gnu.org/wiki/Offloading
> > Nevertheless, the option allows the user to specify compiler option
> > -foffload=disable and no offloading should happen, not even HSA.  The
> > user can also enumerate just the offload targets they want (and pass
> > them special command line stuff).
> > 
> > It seems I have misplaced a hunk in the patch series.  Nevertheless,
> > in the first patch (with configuration stuff), there is a change to
> > opts.c which scans the -foffload= contents and sets the flag variable
> > if hsa is not present.
> > 
> > Whenever the compiler has to decide whether HSA is enabled for the
> > given compilation or not, it has to look at this variable (if
> > configured for HSA).
> 
> Buut what is the difference between
> -foffload=disable
> or
> -foffload={list not including hsa}
> and the new param?  If you don't gridify, you don't emit any kernels...
> 

We do.  When a kernel cannot be gridified, we try to handle it via
dynamic parallelism (i.e. launching a kernel from a kernel) .  Even
though we have not been able to get any good performance with it and
there are several limitations and open problems, I still include this
option in my plans because it is unlikely we will be able to handle
complex scenarios without it (and I hope that as HSA evolves, it will
become a viable, even though most probably always a bit slower,
option).

Apart from the performance degradation, the biggest problem is that
currently HSA dynamic parallelism does not allow you wait for the
completion of the child kernel in a straightforward way.  There is a
hack that allowed us to do it, but by its nature it only allows depth
three dispatch (i.e. kernel->kernel->kernel).  We are limiting
ourselves to depth two, at the moment.

Dynamic parallelism also requires non-trivial preparation at the CPU
side, and quite a few HSA characteristics of the to-be dispatched
kernels have to be known and passed to the GPU when the first kernel
is invoked.  In our current scheme, we have to know the "dependencies"
of each kernel at compile-time, which is sometimes not possible, for
example if the second kernel is invoked from a function that is in a
different compilation unit than the first kernel.

As I said, I hope that with time we will be able to overcome all of
this, but at the moment, dynamic parallelism is clearly just an
experimental feature (that is why I suggested warning when not
gridifying).

I hope this answers the question and explains the situation a bit,

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2015-11-13 13:01 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-05 21:51 Merge of HSA branch Martin Jambor
2015-11-05 21:53 ` [hsa 1/12] Configuration and offloading-related changes Martin Jambor
2015-11-05 22:47   ` Joseph Myers
2015-11-09 16:57     ` Martin Jambor
2015-11-05 21:54 ` [hsa 2/12] Modifications to libgomp proper Martin Jambor
2015-11-12 10:11   ` Jakub Jelinek
2015-11-12 13:22     ` Thomas Schwinge
2015-11-12 14:11       ` Nathan Sidwell
2015-11-12 15:59       ` Jakub Jelinek
2015-11-05 21:56 ` [hsa 3/12] HSA libgomp plugin Martin Jambor
2015-11-05 22:47   ` Joseph Myers
2015-11-09 16:58     ` Martin Jambor
2015-11-05 21:57 ` [hsa 4/12] OpenMP lowering/expansion changes (gridification) Martin Jambor
2015-11-09 10:02   ` Martin Jambor
2015-11-12 11:16   ` Jakub Jelinek
2015-11-05 21:58 ` [hsa 5/12] New HSA-related GCC options Martin Jambor
2015-11-05 22:48   ` Joseph Myers
2015-11-06  8:42   ` Richard Biener
2015-11-09 16:59     ` Martin Jambor
2015-11-10  9:01       ` Richard Biener
2015-11-12 11:19       ` Jakub Jelinek
2015-11-13 13:01         ` Martin Jambor
2015-11-05 21:59 ` [hsa 6/12] IPA-HSA pass Martin Jambor
2015-11-05 22:01 ` [hsa 7/12] Disabling the vectorizer for GPU kernels/functions Martin Jambor
2015-11-06  8:38   ` Richard Biener
2015-11-10 14:48     ` Martin Jambor
2015-11-10 14:59       ` Richard Biener
2015-11-05 22:02 ` [hsa 8/12] Pass manager changes Martin Jambor
2015-11-05 22:03 ` [hsa 9/12] Small alloc-pool fix Martin Jambor
2015-11-06  9:00   ` Richard Biener
2015-11-06  9:52     ` Martin Liška
2015-11-06  9:57       ` Richard Biener
2015-11-10  8:48         ` Martin Liška
2015-11-10 10:07           ` Richard Biener
2015-11-05 22:05 ` [hsa 10/12] HSAIL BRIG description header file (hopefully not a licensing issue) Martin Jambor
2015-11-06 11:29   ` Bernd Schmidt
2015-11-06 12:45     ` Bernd Schmidt
2015-11-05 22:06 ` [hsa 11/12] Majority of the HSA back-end Martin Jambor
2015-11-05 22:07 ` [hsa 12/12] HSA register allocator Martin Jambor
2015-11-06 10:13 ` Merge of HSA branch Bernd Schmidt
2015-11-06 10:30   ` Richard Biener
2015-11-06 11:03     ` Bernd Schmidt
2015-11-06 11:33       ` Thomas Schwinge
2015-11-06 10:54   ` Martin Liška

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).