public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 02/25] Propagate address spaces to builtins.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
@ 2018-09-05 11:49 ` ams
  2018-09-20 13:09   ` Richard Biener
                     ` (2 more replies)
  2018-09-05 11:49 ` [PATCH 04/25] SPECIAL_REGNO_P ams
                   ` (24 subsequent siblings)
  25 siblings, 3 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:49 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 659 bytes --]


At present, pointers passed to builtin functions, including atomic operators,
are stripped of their address space properties.  This doesn't seem to be
deliberate, it just omits to copy them.

Not only that, but it forces pointer sizes to Pmode, which isn't appropriate
for all address spaces.

This patch attempts to correct both issues.  It works for GCN atomics and
GCN OpenACC gang-private variables.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>

	gcc/
	* builtins.c (get_builtin_sync_mem): Handle address spaces.
---
 gcc/builtins.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0002-Propagate-address-spaces-to-builtins.patch --]
[-- Type: text/x-patch; name="0002-Propagate-address-spaces-to-builtins.patch", Size: 1118 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 58ea747..361361c 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -5781,14 +5781,21 @@ static rtx
 get_builtin_sync_mem (tree loc, machine_mode mode)
 {
   rtx addr, mem;
+  int addr_space = TYPE_ADDR_SPACE (POINTER_TYPE_P (TREE_TYPE (loc))
+				    ? TREE_TYPE (TREE_TYPE (loc))
+				    : TREE_TYPE (loc));
+  scalar_int_mode addr_mode = targetm.addr_space.address_mode (addr_space);
 
-  addr = expand_expr (loc, NULL_RTX, ptr_mode, EXPAND_SUM);
-  addr = convert_memory_address (Pmode, addr);
+  addr = expand_expr (loc, NULL_RTX, addr_mode, EXPAND_SUM);
 
   /* Note that we explicitly do not want any alias information for this
      memory, so that we kill all other live memories.  Otherwise we don't
      satisfy the full barrier semantics of the intrinsic.  */
-  mem = validize_mem (gen_rtx_MEM (mode, addr));
+  mem = gen_rtx_MEM (mode, addr);
+
+  set_mem_addr_space (mem, addr_space);
+
+  mem = validize_mem (mem);
 
   /* The alignment needs to be at least according to that of the mode.  */
   set_mem_align (mem, MAX (GET_MODE_ALIGNMENT (mode),

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 05/25] Add sorry_at diagnostic function.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (2 preceding siblings ...)
  2018-09-05 11:49 ` [PATCH 01/25] Handle vectors that don't fit in an integer ams
@ 2018-09-05 11:49 ` ams
  2018-09-05 13:39   ` David Malcolm
  2018-09-05 11:50 ` [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode ams
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:49 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]


The plain "sorry" diagnostic only gives the "current" location, which is
typically the last line of the function or translation unit by time we get to
the back end.

GCN uses "sorry" to report unsupported language features, such as static
constructors, so it's useful to have a "sorry_at" variant.

This patch implements "sorry_at" according to the pattern of the other "at"
variants.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* diagnostic-core.h (sorry_at): New prototype.
	* diagnostic.c (sorry_at): New function.
---
 gcc/diagnostic-core.h |  1 +
 gcc/diagnostic.c      | 11 +++++++++++
 2 files changed, 12 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0005-Add-sorry_at-diagnostic-function.patch --]
[-- Type: text/x-patch; name="0005-Add-sorry_at-diagnostic-function.patch", Size: 1323 bytes --]

diff --git a/gcc/diagnostic-core.h b/gcc/diagnostic-core.h
index e4ebe00..80ff395 100644
--- a/gcc/diagnostic-core.h
+++ b/gcc/diagnostic-core.h
@@ -96,6 +96,7 @@ extern bool permerror (location_t, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern bool permerror (rich_location *, const char *,
 				   ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void sorry (const char *, ...) ATTRIBUTE_GCC_DIAG(1,2);
+extern void sorry_at (location_t, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void inform (location_t, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void inform (rich_location *, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void inform_n (location_t, unsigned HOST_WIDE_INT, const char *,
diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
index aae0934..56a1140 100644
--- a/gcc/diagnostic.c
+++ b/gcc/diagnostic.c
@@ -1443,6 +1443,17 @@ sorry (const char *gmsgid, ...)
   va_end (ap);
 }
 
+/* Same as above, but use location LOC instead of input_location.  */
+void
+sorry_at (location_t loc, const char *gmsgid, ...)
+{
+  va_list ap;
+  va_start (ap, gmsgid);
+  rich_location richloc (line_table, loc);
+  diagnostic_impl (&richloc, -1, gmsgid, &ap, DK_SORRY);
+  va_end (ap);
+}
+
 /* Return true if an error or a "sorry" has been seen.  Various
    processing is disabled after errors.  */
 bool

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 01/25] Handle vectors that don't fit in an integer.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
  2018-09-05 11:49 ` [PATCH 02/25] Propagate address spaces to builtins ams
  2018-09-05 11:49 ` [PATCH 04/25] SPECIAL_REGNO_P ams
@ 2018-09-05 11:49 ` ams
  2018-09-05 11:54   ` Jakub Jelinek
  2018-09-14 16:03   ` Richard Sandiford
  2018-09-05 11:49 ` [PATCH 05/25] Add sorry_at diagnostic function ams
                   ` (22 subsequent siblings)
  25 siblings, 2 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:49 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]


GCN vector sizes range between 64 and 512 bytes, none of which have
correspondingly sized integer modes.  This breaks a number of assumptions
throughout the compiler, but I don't really want to create modes just for this
purpose.

Instead, this patch fixes up the cases that I've found, so far, such that the
compiler tries something else, or fails to optimize, rather than just ICE.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
            Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Jan Hubicka  <jh@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	gcc/
	* combine.c (gen_lowpart_or_truncate): Return clobber if there is
	not a integer mode if the same size as x.
	(gen_lowpart_for_combine): Fail if there is no integer mode of the
	same size.
	* expr.c (expand_expr_real_1): Force first operand to be in memory
	if it is a vector register and the result is in	BLKmode.
	* tree-vect-stmts.c (vectorizable_store): Don't ICE when
	int_mode_for_size fails.
	(vectorizable_load): Likewise.
---
 gcc/combine.c         | 13 ++++++++++++-
 gcc/expr.c            |  8 ++++++++
 gcc/tree-vect-stmts.c |  8 ++++----
 3 files changed, 24 insertions(+), 5 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Handle-vectors-that-don-t-fit-in-an-integer.patch --]
[-- Type: text/x-patch; name="0001-Handle-vectors-that-don-t-fit-in-an-integer.patch", Size: 3506 bytes --]

diff --git a/gcc/combine.c b/gcc/combine.c
index a2649b6..cbf9dae 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -8621,7 +8621,13 @@ gen_lowpart_or_truncate (machine_mode mode, rtx x)
     {
       /* Bit-cast X into an integer mode.  */
       if (!SCALAR_INT_MODE_P (GET_MODE (x)))
-	x = gen_lowpart (int_mode_for_mode (GET_MODE (x)).require (), x);
+	{
+	  enum machine_mode imode =
+	    int_mode_for_mode (GET_MODE (x)).require ();
+	  if (imode == BLKmode)
+	    return gen_rtx_CLOBBER (mode, const0_rtx);
+	  x = gen_lowpart (imode, x);
+	}
       x = simplify_gen_unary (TRUNCATE, int_mode_for_mode (mode).require (),
 			      x, GET_MODE (x));
     }
@@ -11698,6 +11704,11 @@ gen_lowpart_for_combine (machine_mode omode, rtx x)
   if (omode == imode)
     return x;
 
+  /* This can happen when there is no integer mode corresponding
+     to a size of vector mode.  */
+  if (omode == BLKmode)
+    goto fail;
+
   /* We can only support MODE being wider than a word if X is a
      constant integer or has a mode the same size.  */
   if (maybe_gt (GET_MODE_SIZE (omode), UNITS_PER_WORD)
diff --git a/gcc/expr.c b/gcc/expr.c
index cd5cf12..776254a 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -10569,6 +10569,14 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
 			  || maybe_gt (bitpos + bitsize,
 				       GET_MODE_BITSIZE (mode2)));
 
+	/* If the result is in BLKmode and the underlying object is a
+	   vector in a register, and the size of the vector is larger than
+	   the largest integer mode, then we must force OP0 to be in memory
+	   as this is assumed in later code.  */
+	if (REG_P (op0) && VECTOR_MODE_P (mode2) && mode == BLKmode
+	    && maybe_gt (bitsize, MAX_FIXED_MODE_SIZE))
+	  must_force_mem = 1;
+
 	/* Handle CONCAT first.  */
 	if (GET_CODE (op0) == CONCAT && !must_force_mem)
 	  {
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 8d94fca..607a2bd 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -6702,12 +6702,12 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		     supported.  */
 		  unsigned lsize
 		    = group_size * GET_MODE_BITSIZE (elmode);
-		  elmode = int_mode_for_size (lsize, 0).require ();
 		  unsigned int lnunits = const_nunits / group_size;
 		  /* If we can't construct such a vector fall back to
 		     element extracts from the original vector type and
 		     element size stores.  */
-		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
+		  if (int_mode_for_size (lsize, 0).exists (&elmode)
+		      && mode_for_vector (elmode, lnunits).exists (&vmode)
 		      && VECTOR_MODE_P (vmode)
 		      && targetm.vector_mode_supported_p (vmode)
 		      && (convert_optab_handler (vec_extract_optab,
@@ -7839,11 +7839,11 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		     to a larger load.  */
 		  unsigned lsize
 		    = group_size * TYPE_PRECISION (TREE_TYPE (vectype));
-		  elmode = int_mode_for_size (lsize, 0).require ();
 		  unsigned int lnunits = const_nunits / group_size;
 		  /* If we can't construct such a vector fall back to
 		     element loads of the original vector type.  */
-		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
+		  if (int_mode_for_size (lsize, 0).exists (&elmode)
+		      && mode_for_vector (elmode, lnunits).exists (&vmode)
 		      && VECTOR_MODE_P (vmode)
 		      && targetm.vector_mode_supported_p (vmode)
 		      && (convert_optab_handler (vec_init_optab, vmode, elmode)

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
  2018-09-05 11:49 ` [PATCH 02/25] Propagate address spaces to builtins ams
@ 2018-09-05 11:49 ` ams
  2018-09-05 12:21   ` Joseph Myers
                     ` (2 more replies)
  2018-09-05 11:49 ` [PATCH 01/25] Handle vectors that don't fit in an integer ams
                   ` (23 subsequent siblings)
  25 siblings, 3 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:49 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 850 bytes --]


GCN has some registers which are special purpose, but not "fixed" because we
want the register allocator to track their usage and select alternatives that
use different special registers (e.g. scalar cc vs. vector cc).

Sometimes this leads the regrename pass to ICE.  Quite how it gets confused is
not well understood, but considering such registers for renaming is surely not
useful.

This patch creates a new macro SPECIAL_REGNO_P which disables regrename.  In
other words, the register is fixed once allocated.

2018-09-05  Kwok Cheung Yeung  <kcy@codesourcery.com>

	gcc/
	* defaults.h (SPECIAL_REGNO_P): Define to false by default.
	* regrename.c (check_new_reg_p): Do not rename to a special register.
	(rename_chains): Do not rename special registers.
---
 gcc/defaults.h  | 4 ++++
 gcc/regrename.c | 2 ++
 2 files changed, 6 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0004-SPECIAL_REGNO_P.patch --]
[-- Type: text/x-patch; name="0004-SPECIAL_REGNO_P.patch", Size: 1212 bytes --]

diff --git a/gcc/defaults.h b/gcc/defaults.h
index 9035b33..40ecf61 100644
--- a/gcc/defaults.h
+++ b/gcc/defaults.h
@@ -1198,6 +1198,10 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 #define NO_FUNCTION_CSE false
 #endif
 
+#ifndef SPECIAL_REGNO_P
+#define SPECIAL_REGNO_P(REGNO) false
+#endif
+
 #ifndef HARD_REGNO_RENAME_OK
 #define HARD_REGNO_RENAME_OK(FROM, TO) true
 #endif
diff --git a/gcc/regrename.c b/gcc/regrename.c
index 8424093..92e403e 100644
--- a/gcc/regrename.c
+++ b/gcc/regrename.c
@@ -320,6 +320,7 @@ check_new_reg_p (int reg ATTRIBUTE_UNUSED, int new_reg,
     if (TEST_HARD_REG_BIT (this_unavailable, new_reg + i)
 	|| fixed_regs[new_reg + i]
 	|| global_regs[new_reg + i]
+	|| SPECIAL_REGNO_P (new_reg + i)
 	/* Can't use regs which aren't saved by the prologue.  */
 	|| (! df_regs_ever_live_p (new_reg + i)
 	    && ! call_used_regs[new_reg + i])
@@ -480,6 +481,7 @@ rename_chains (void)
 	continue;
 
       if (fixed_regs[reg] || global_regs[reg]
+	  || SPECIAL_REGNO_P (reg)
 	  || (!HARD_FRAME_POINTER_IS_FRAME_POINTER && frame_pointer_needed
 	      && reg == HARD_FRAME_POINTER_REGNUM)
 	  || (HARD_FRAME_POINTER_IS_FRAME_POINTER && frame_pointer_needed

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 00/25] AMD GCN Port
@ 2018-09-05 11:49 ams
  2018-09-05 11:49 ` [PATCH 02/25] Propagate address spaces to builtins ams
                   ` (25 more replies)
  0 siblings, 26 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:49 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

Hi All,

This patch series contains the non-OpenACC/OpenMP portions of a port to
AMD GCN3 and GCN5 GPU processors.  It's sufficient to build
single-threaded programs, with vectorization in the usual way.  C and
Fortran are supported, C++ is not supported, and the other front-ends
have not been tested.  The OpenACC/OpenMP/libgomp portion will follow,
once this is committed, eventually.

If the Steering Committee approve the port and the patches are accepted
then I'd like to see the port make it into GCC 9, please.

The patches, as they are, are not perfect; I still want to massage the
test results a little, but I'd like to find out about big review issues
sooner rather than later.

I've posted the middle-end patches first.  Some of these are target
independent issues, but are included in the series because they are
required for GCN to work properly.

I've then split the back-end patches into libgfortran, libgcc, and the
back-end proper.

Finally I have the testsuite tweaks and fix ups.  I don't have any
GCN-specific tests as yet; the existing tests serve to demonstrate
correctness, and I anticipate future GCN tests being largely
optimization issues, such as instruction selection and vectorization
coverage.

I'm aware that I still need to make the necessary documentation
adjustments.

Thanks in advance

-- 
Andrew Stubbs
Mentor Graphics / CodeSourcery

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 09/25] Elide repeated RTL elements.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (6 preceding siblings ...)
  2018-09-05 11:50 ` [PATCH 06/25] Remove constant vec_select restriction ams
@ 2018-09-05 11:50 ` ams
  2018-09-11 22:46   ` Jeff Law
  2018-09-05 11:50 ` [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME ams
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 520 bytes --]


GCN's 64-lane vectors tend to make RTL dumps very long.  This patch makes them
far more bearable by eliding long sequences of the same element into "repeated"
messages.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
	    Jan Hubicka  <jh@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	* print-rtl.c (print_rtx_operand_codes_E_and_V): Print how many times
	the same elements are repeated rather than printing all of them.
---
 gcc/print-rtl.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0009-Elide-repeated-RTL-elements.patch --]
[-- Type: text/x-patch; name="0009-Elide-repeated-RTL-elements.patch", Size: 672 bytes --]

diff --git a/gcc/print-rtl.c b/gcc/print-rtl.c
index 5dd2e31..8a04264 100644
--- a/gcc/print-rtl.c
+++ b/gcc/print-rtl.c
@@ -370,7 +370,20 @@ rtx_writer::print_rtx_operand_codes_E_and_V (const_rtx in_rtx, int idx)
 	m_sawclose = 1;
 
       for (int j = 0; j < XVECLEN (in_rtx, idx); j++)
-	print_rtx (XVECEXP (in_rtx, idx, j));
+	{
+	  int j1;
+
+	  print_rtx (XVECEXP (in_rtx, idx, j));
+	  for (j1 = j + 1; j1 < XVECLEN (in_rtx, idx); j1++)
+	    if (XVECEXP (in_rtx, idx, j) != XVECEXP (in_rtx, idx, j1))
+	      break;
+
+	  if (j1 != j + 1)
+	    {
+	      fprintf (m_outfile, " repeated %ix", j1 - j);
+	      j = j1 - 1;
+	    }
+	}
 
       m_indent -= 2;
     }

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 08/25] Fix co-array allocation
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (4 preceding siblings ...)
  2018-09-05 11:50 ` [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode ams
@ 2018-09-05 11:50 ` ams
       [not found]   ` <7f5064c3-afc6-b7b5-cade-f03af5b86331@moene.org>
  2018-09-05 11:50 ` [PATCH 06/25] Remove constant vec_select restriction ams
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]


The Fortran front-end has a bug in which it uses "int" values for "size_t"
parameters.  I don't know why this isn't problem for all 64-bit architectures,
but GCN ends up with the data in the wrong argument register and/or stack slot,
and bad things happen.

This patch corrects the issue by setting the correct type.

2018-09-05  Kwok Cheung Yeung  <kcy@codesourcery.com>

	gcc/fortran/
	* trans-expr.c (gfc_trans_structure_assign): Ensure that
	integer_zero_node is of sizetype when used as the first
	argument of a call to _gfortran_caf_register.
	* trans-intrinsic.c (conv_intrinsic_event_query): Convert computed
	index to a size_t type.
	* trans-stmt.c (gfc_trans_event_post_wait): Likewise.
---
 gcc/fortran/trans-expr.c      | 2 +-
 gcc/fortran/trans-intrinsic.c | 3 ++-
 gcc/fortran/trans-stmt.c      | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0008-Fix-co-array-allocation.patch --]
[-- Type: text/x-patch; name="0008-Fix-co-array-allocation.patch", Size: 1831 bytes --]

diff --git a/gcc/fortran/trans-expr.c b/gcc/fortran/trans-expr.c
index 56ce98c..91be3fb 100644
--- a/gcc/fortran/trans-expr.c
+++ b/gcc/fortran/trans-expr.c
@@ -7729,7 +7729,7 @@ gfc_trans_structure_assign (tree dest, gfc_expr * expr, bool init, bool coarray)
 		 suffices to recognize the data as array.  */
 	      if (rank < 0)
 		rank = 1;
-	      size = integer_zero_node;
+	      size = fold_convert (sizetype, integer_zero_node);
 	      desc = field;
 	      gfc_add_modify (&block, gfc_conv_descriptor_rank (desc),
 			      build_int_cst (signed_char_type_node, rank));
diff --git a/gcc/fortran/trans-intrinsic.c b/gcc/fortran/trans-intrinsic.c
index b2cea93..23c13da 100644
--- a/gcc/fortran/trans-intrinsic.c
+++ b/gcc/fortran/trans-intrinsic.c
@@ -10732,7 +10732,8 @@ conv_intrinsic_event_query (gfc_code *code)
 	      tmp = fold_build2_loc (input_location, MULT_EXPR,
 				     integer_type_node, extent, tmp);
 	      index = fold_build2_loc (input_location, PLUS_EXPR,
-				       integer_type_node, index, tmp);
+				       size_type_node, index,
+				       fold_convert (size_type_node, tmp));
 	      if (i < ar->dimen - 1)
 		{
 		  ubound = gfc_conv_descriptor_ubound_get (desc, gfc_rank_cst[i]);
diff --git a/gcc/fortran/trans-stmt.c b/gcc/fortran/trans-stmt.c
index 795d3cc..2c59675 100644
--- a/gcc/fortran/trans-stmt.c
+++ b/gcc/fortran/trans-stmt.c
@@ -1096,7 +1096,8 @@ gfc_trans_event_post_wait (gfc_code *code, gfc_exec_op op)
 	  tmp = fold_build2_loc (input_location, MULT_EXPR,
 				 integer_type_node, extent, tmp);
 	  index = fold_build2_loc (input_location, PLUS_EXPR,
-				   integer_type_node, index, tmp);
+				   size_type_node, index,
+				   fold_convert (size_type_node, tmp));
 	  if (i < ar->dimen - 1)
 	    {
 	      ubound = gfc_conv_descriptor_ubound_get (desc, gfc_rank_cst[i]);

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 06/25] Remove constant vec_select restriction.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (5 preceding siblings ...)
  2018-09-05 11:50 ` [PATCH 08/25] Fix co-array allocation ams
@ 2018-09-05 11:50 ` ams
  2018-09-11 22:44   ` Jeff Law
  2018-09-05 11:50 ` [PATCH 09/25] Elide repeated RTL elements ams
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 794 bytes --]


The vec_select operator is documented to require a const_int for the lane
selector operand, but GCN has an instruction that can select the lane at
runtime, so it seems reasonable to remove this restriction.

This patch simply replaces assertions that the operand is constant with early
exits from the optimizers.  I think it's reasonable that vec_select with a
non-constant operand cannot be optimized, yet.

Also included is the necessary documentation tweak.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* doc/rtl.texi: Adjust vec_select description.
	* simplify-rtx.c (simplify_binary_operation_1): Allow VEC_SELECT to use
	non-constant selectors.
---
 gcc/doc/rtl.texi   | 11 ++++++-----
 gcc/simplify-rtx.c |  9 +++++++--
 2 files changed, 13 insertions(+), 7 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0006-Remove-constant-vec_select-restriction.patch --]
[-- Type: text/x-patch; name="0006-Remove-constant-vec_select-restriction.patch", Size: 2149 bytes --]

diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index 5b1e695..0695ad2 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -2939,11 +2939,12 @@ a set bit indicates it is taken from @var{vec1}.
 @item (vec_select:@var{m} @var{vec1} @var{selection})
 This describes an operation that selects parts of a vector.  @var{vec1} is
 the source vector, and @var{selection} is a @code{parallel} that contains a
-@code{const_int} for each of the subparts of the result vector, giving the
-number of the source subpart that should be stored into it.
-The result mode @var{m} is either the submode for a single element of
-@var{vec1} (if only one subpart is selected), or another vector mode
-with that element submode (if multiple subparts are selected).
+@code{const_int} (or another expression, if the selection can be made at
+runtime) for each of the subparts of the result vector, giving the number of
+the source subpart that should be stored into it.  The result mode @var{m} is
+either the submode for a single element of @var{vec1} (if only one subpart is
+selected), or another vector mode with that element submode (if multiple
+subparts are selected).
 
 @findex vec_concat
 @item (vec_concat:@var{m} @var{x1} @var{x2})
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index a9f2586..b4c6883 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -3604,7 +3604,10 @@ simplify_binary_operation_1 (enum rtx_code code, machine_mode mode,
 	  gcc_assert (mode == GET_MODE_INNER (GET_MODE (trueop0)));
 	  gcc_assert (GET_CODE (trueop1) == PARALLEL);
 	  gcc_assert (XVECLEN (trueop1, 0) == 1);
-	  gcc_assert (CONST_INT_P (XVECEXP (trueop1, 0, 0)));
+
+	  /* We can't reason about selections made at runtime.  */
+	  if (!CONST_INT_P (XVECEXP (trueop1, 0, 0)))
+	    return 0;
 
 	  if (vec_duplicate_p (trueop0, &elt0))
 	    return elt0;
@@ -3703,7 +3706,9 @@ simplify_binary_operation_1 (enum rtx_code code, machine_mode mode,
 		{
 		  rtx x = XVECEXP (trueop1, 0, i);
 
-		  gcc_assert (CONST_INT_P (x));
+		  if (!CONST_INT_P (x))
+		    return 0;
+
 		  RTVEC_ELT (v, i) = CONST_VECTOR_ELT (trueop0,
 						       INTVAL (x));
 		}

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (3 preceding siblings ...)
  2018-09-05 11:49 ` [PATCH 05/25] Add sorry_at diagnostic function ams
@ 2018-09-05 11:50 ` ams
  2018-09-17  8:46   ` Richard Sandiford
  2018-09-05 11:50 ` [PATCH 08/25] Fix co-array allocation ams
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 440 bytes --]


This is an update of the patch posted to PR82089 long ago.  We ran into the
same bug on GCN, so we need this fixed as part of this series.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
            Tom de Vries  <tom@codesourcery.com>

	PR82089

	gcc/
	* expmed.c (emit_cstore): Fix handling of result_mode == BImode and
	STORE_FLAG_VALUE == 1.
---
 gcc/expmed.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0007-pr82089-Don-t-sign-extend-SFV-1-in-BImode.patch --]
[-- Type: text/x-patch; name="0007-pr82089-Don-t-sign-extend-SFV-1-in-BImode.patch", Size: 1089 bytes --]

diff --git a/gcc/expmed.c b/gcc/expmed.c
index 29ce10b..0b87fdc 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -5464,11 +5464,18 @@ emit_cstore (rtx target, enum insn_code icode, enum rtx_code code,
      If STORE_FLAG_VALUE does not have the sign bit set when
      interpreted in MODE, we can do this conversion as unsigned, which
      is usually more efficient.  */
-  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode))
+  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode)
+      || (result_mode == BImode && int_target_mode != BImode))
     {
-      convert_move (target, subtarget,
-		    val_signbit_known_clear_p (result_mode,
-					       STORE_FLAG_VALUE));
+      gcc_assert (GET_MODE_SIZE (result_mode) != 1
+		  || STORE_FLAG_VALUE == 1 || STORE_FLAG_VALUE == -1);
+      bool unsignedp
+	= (GET_MODE_SIZE (result_mode) == 1
+	   ? STORE_FLAG_VALUE == 1
+	   : val_signbit_known_clear_p (result_mode, STORE_FLAG_VALUE));
+
+      convert_move (target, subtarget, unsignedp);
+
       op0 = target;
       result_mode = int_target_mode;
     }

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 10/25] Convert BImode vectors.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (8 preceding siblings ...)
  2018-09-05 11:50 ` [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME ams
@ 2018-09-05 11:50 ` ams
  2018-09-05 11:56   ` Jakub Jelinek
                     ` (2 more replies)
  2018-09-05 11:50 ` [PATCH 12/25] Make default_static_chain return NULL in non-static functions ams
                   ` (15 subsequent siblings)
  25 siblings, 3 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 983 bytes --]


GCN uses V64BImode to represent vector masks in the middle-end, and DImode
bit-masks to represent them in the back-end.  These must be converted at expand
time and the most convenient way is to simply use a SUBREG.

This works fine except that simplify_subreg needs to be able to convert
immediates, mostly for REG_EQUAL and REG_EQUIV, and currently does not know how
to convert vectors to integers where there is more than one element per byte.

This patch implements such conversions for the cases that we need.

I don't know why this is not a problem for other targets that use BImode
vectors, such as ARM SVE, so it's possible I missed some magic somewhere?

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* simplify-rtx.c (convert_packed_vector): New function.
	(simplify_immed_subreg): Recognised Boolean vectors and call
	convert_packed_vector.
---
 gcc/simplify-rtx.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0010-Convert-BImode-vectors.patch --]
[-- Type: text/x-patch; name="0010-Convert-BImode-vectors.patch", Size: 3261 bytes --]

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index b4c6883..89487f2 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5976,6 +5976,73 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
   return 0;
 }
 
+/* Convert a CONST_INT to a CONST_VECTOR, or vice versa.
+
+   This should only occur for VECTOR_BOOL_MODE types, so the semantics
+   specified by that are assumed.  In particular, the lowest value is
+   in the first byte.  */
+
+static rtx
+convert_packed_vector (fixed_size_mode to_mode, rtx op,
+		       machine_mode from_mode, unsigned int byte,
+		       unsigned int first_elem, unsigned int inner_bytes)
+{
+  /* Sizes greater than HOST_WIDE_INT would need a better implementation.  */
+  gcc_assert (GET_MODE_SIZE (to_mode) <= sizeof (HOST_WIDE_INT));
+
+  if (GET_CODE (op) == CONST_VECTOR)
+    {
+      gcc_assert (!VECTOR_MODE_P (to_mode));
+
+      int num_elem = GET_MODE_NUNITS (from_mode).to_constant();
+      int elem_bitsize = (GET_MODE_SIZE (from_mode).to_constant()
+			  * BITS_PER_UNIT) / num_elem;
+      int elem_mask = (1 << elem_bitsize) - 1;
+      HOST_WIDE_INT subreg_mask =
+	(sizeof (HOST_WIDE_INT) == GET_MODE_SIZE (to_mode)
+	 ? -1
+	 : (((HOST_WIDE_INT)1 << (GET_MODE_SIZE (to_mode) * BITS_PER_UNIT))
+	    - 1));
+
+      HOST_WIDE_INT val = 0;
+      for (int i = 0; i < num_elem; i++)
+	val |= ((INTVAL (CONST_VECTOR_ELT (op, i)) & elem_mask)
+		<< (i * elem_bitsize));
+
+      val >>= byte * BITS_PER_UNIT;
+      val &= subreg_mask;
+
+      return gen_rtx_CONST_INT (VOIDmode, val);
+    }
+  else if (GET_CODE (op) == CONST_INT)
+    {
+      /* Subregs of a vector not implemented yet.  */
+      gcc_assert (maybe_eq (GET_MODE_SIZE (to_mode),
+			    GET_MODE_SIZE (from_mode)));
+
+      gcc_assert (VECTOR_MODE_P (to_mode));
+
+      int num_elem = GET_MODE_NUNITS (to_mode);
+      int elem_bitsize = (GET_MODE_SIZE (to_mode) * BITS_PER_UNIT) / num_elem;
+      int elem_mask = (1 << elem_bitsize) - 1;
+
+      rtvec val = rtvec_alloc (num_elem);
+      rtx *elem = &RTVEC_ELT (val, 0);
+
+      for (int i = 0; i < num_elem; i++)
+	elem[i] = gen_rtx_CONST_INT (VOIDmode,
+				     (INTVAL (op) >> (i * elem_bitsize))
+				     & elem_mask);
+
+      return gen_rtx_CONST_VECTOR (to_mode, val);
+    }
+  else
+    {
+      gcc_unreachable ();
+      return op;
+    }
+}
+
 /* Evaluate a SUBREG of a CONST_INT or CONST_WIDE_INT or CONST_DOUBLE
    or CONST_FIXED or CONST_VECTOR, returning another CONST_INT or
    CONST_WIDE_INT or CONST_DOUBLE or CONST_FIXED or CONST_VECTOR.
@@ -6017,6 +6084,15 @@ simplify_immed_subreg (fixed_size_mode outermode, rtx op,
   if (COMPLEX_MODE_P (outermode))
     return NULL_RTX;
 
+  /* Vectors with multiple elements per byte are a special case.  */
+  if ((VECTOR_MODE_P (innermode)
+       && ((GET_MODE_NUNITS (innermode).to_constant()
+	    / GET_MODE_SIZE(innermode).to_constant()) > 1))
+      || (VECTOR_MODE_P (outermode)
+	  && (GET_MODE_NUNITS (outermode) / GET_MODE_SIZE(outermode) > 1)))
+    return convert_packed_vector (outermode, op, innermode, byte, first_elem,
+				  inner_bytes);
+
   /* We support any size mode.  */
   max_bitsize = MAX (GET_MODE_BITSIZE (outermode),
 		     inner_bytes * BITS_PER_UNIT);

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (7 preceding siblings ...)
  2018-09-05 11:50 ` [PATCH 09/25] Elide repeated RTL elements ams
@ 2018-09-05 11:50 ` ams
  2018-09-11 22:56   ` Jeff Law
  2018-09-05 11:50 ` [PATCH 10/25] Convert BImode vectors ams
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1119 bytes --]


The HSA GPU drivers can't cope with binaries that have the same symbol defined
multiple times, even though the names are not exported.  This happens whenever
there are file-scope static variables with matching names.  I believe it's also
an issue with switch tables.

This is a bug, but outside our control, so we must work around it when multiple
translation units have the same symbol defined.

Therefore, we've implemented name mangling via
TARGET_MANGLE_DECL_ASSEMBLER_NAME, but found some places where the middle-end
assumes that the decl name matches the name in the source.

This patch fixes up those cases by falling back to comparing the unmangled
name, when a lookup fails.

2018-09-05  Julian Brown  <julian@codesourcery.com>

	gcc/
	* cgraphunit.c (handle_alias_pairs): Scan for aliases by DECL_NAME if
	decl assembler name doesn't match.

	gcc/c-family/
	* c-pragma.c (maye_apply_pending_pragma_weaks): Scan for aliases with
	DECL_NAME if decl assembler name doesn't match.
---
 gcc/c-family/c-pragma.c | 14 ++++++++++++++
 gcc/cgraphunit.c        | 15 +++++++++++++++
 2 files changed, 29 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0003-Improve-TARGET_MANGLE_DECL_ASSEMBLER_NAME.patch --]
[-- Type: text/x-patch; name="0003-Improve-TARGET_MANGLE_DECL_ASSEMBLER_NAME.patch", Size: 1597 bytes --]

diff --git a/gcc/c-family/c-pragma.c b/gcc/c-family/c-pragma.c
index 84e4341..1c0be0c 100644
--- a/gcc/c-family/c-pragma.c
+++ b/gcc/c-family/c-pragma.c
@@ -323,6 +323,20 @@ maybe_apply_pending_pragma_weaks (void)
 	continue;
 
       target = symtab_node::get_for_asmname (id);
+
+      /* Try again if ID didn't match an assembler name by looking through
+	 decl names.  */
+      if (!target)
+	{
+	  symtab_node *node;
+	  FOR_EACH_SYMBOL (node)
+	    if (strcmp (IDENTIFIER_POINTER (id), node->name ()) == 0)
+	      {
+	        target = node;
+		break;
+	      }
+	}
+
       decl = build_decl (UNKNOWN_LOCATION,
 			 target ? TREE_CODE (target->decl) : FUNCTION_DECL,
 			 alias_id, default_function_type);
diff --git a/gcc/cgraphunit.c b/gcc/cgraphunit.c
index ec490d7..fc3f34e 100644
--- a/gcc/cgraphunit.c
+++ b/gcc/cgraphunit.c
@@ -1393,6 +1393,21 @@ handle_alias_pairs (void)
     {
       symtab_node *target_node = symtab_node::get_for_asmname (p->target);
 
+      /* If the alias target didn't match a symbol's assembler name (e.g.
+	 because it has been mangled by TARGET_MANGLE_DECL_ASSEMBLER_NAME),
+	 try again with the unmangled decl name.  */
+      if (!target_node)
+	{
+	  symtab_node *node;
+	  FOR_EACH_SYMBOL (node)
+	    if (strcmp (IDENTIFIER_POINTER (p->target),
+			node->name ()) == 0)
+	      {
+		target_node = node;
+		break;
+	      }
+	}
+
       /* Weakrefs with target not defined in current unit are easy to handle:
 	 they behave just as external variables except we need to note the
 	 alias flag to later output the weakref pseudo op into asm file.  */

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 12/25] Make default_static_chain return NULL in non-static functions
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (9 preceding siblings ...)
  2018-09-05 11:50 ` [PATCH 10/25] Convert BImode vectors ams
@ 2018-09-05 11:50 ` ams
  2018-09-17 18:55   ` Richard Sandiford
  2018-09-05 11:51 ` [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE ams
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 421 bytes --]


This patch allows default_static_chain to be called from the back-end without
it knowing if the function is static or not.  Or, to put it another way,
without duplicating the check everywhere it's used.

2018-09-05  Tom de Vries  <tom@codesourcery.com>

	gcc/
	* targhooks.c (default_static_chain): Return NULL in non-static
	functions.
---
 gcc/targhooks.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0012-Make-default_static_chain-return-NULL-in-non-static-.patch --]
[-- Type: text/x-patch; name="0012-Make-default_static_chain-return-NULL-in-non-static-.patch", Size: 690 bytes --]

diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index afd56f3..742cfbf 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1021,8 +1021,14 @@ default_internal_arg_pointer (void)
 }
 
 rtx
-default_static_chain (const_tree ARG_UNUSED (fndecl_or_type), bool incoming_p)
+default_static_chain (const_tree fndecl_or_type, bool incoming_p)
 {
+  /* While this function won't be called by the middle-end when a static
+     chain isn't needed, it's also used throughout the backend so it's
+     easiest to keep this check centralized.  */
+  if (DECL_P (fndecl_or_type) && !DECL_STATIC_CHAIN (fndecl_or_type))
+    return NULL;
+
   if (incoming_p)
     {
 #ifdef STATIC_CHAIN_INCOMING_REGNUM

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 18/25] Fix interleaving of Fortran stop messages
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (13 preceding siblings ...)
  2018-09-05 11:51 ` [PATCH 17/25] Fix Fortran STOP ams
@ 2018-09-05 11:51 ` ams
       [not found]   ` <994a9ec6-2494-9a83-cc84-bd8a551142c5@moene.org>
  2018-09-05 11:51 ` [PATCH 11/25] Simplify vec_merge according to the mask ams
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]


Fortran STOP and ERROR STOP use a different function to print the "STOP" string
and the message string.  On GCN this results in out-of-order output, such as
"<msg>ERROR STOP ".

This patch fixes the problem by making estr_write use the proper Fortran write,
not C printf, so both parts are now output the same way.  This also ensures
that both parts are output to STDERR (not that that means anything on GCN).

2018-09-05  Kwok Cheung Yeung  <kcy@codesourcery.com>

	libgfortran/
	* runtime/minimal.c (estr_write): Define in terms of write.
---
 libgfortran/runtime/minimal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0018-Fix-interleaving-of-Fortran-stop-messages.patch --]
[-- Type: text/x-patch; name="0018-Fix-interleaving-of-Fortran-stop-messages.patch", Size: 491 bytes --]

diff --git a/libgfortran/runtime/minimal.c b/libgfortran/runtime/minimal.c
index 8940f97..b6d26fd 100644
--- a/libgfortran/runtime/minimal.c
+++ b/libgfortran/runtime/minimal.c
@@ -196,7 +196,7 @@ sys_abort (void)
 #undef st_printf
 #define st_printf printf
 #undef estr_write
-#define estr_write printf
+#define estr_write(X) write(STDERR_FILENO, (X), strlen (X))
 #if __nvptx__
 /* Map "exit" to "abort"; see PR85463 '[nvptx] "exit" in offloaded region
    doesn't terminate process'.  */

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (14 preceding siblings ...)
  2018-09-05 11:51 ` [PATCH 18/25] Fix interleaving of Fortran stop messages ams
@ 2018-09-05 11:51 ` ams
  2018-09-17  9:08   ` Richard Sandiford
  2018-09-05 11:51 ` [PATCH 15/25] Don't double-count early-clobber matches ams
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 510 bytes --]


This patch was part of the original patch we acquired from Honza and Martin.

It simplifies vector elements that are inactive, according to the mask.

2018-09-05  Jan Hubicka  <jh@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	* simplify-rtx.c (simplify_merge_mask): New function.
	(simplify_ternary_operation): Use it, also see if VEC_MERGEs with the
	same masks are used in op1 or op2.
---
 gcc/simplify-rtx.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0011-Simplify-vec_merge-according-to-the-mask.patch --]
[-- Type: text/x-patch; name="0011-Simplify-vec_merge-according-to-the-mask.patch", Size: 3602 bytes --]

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 89487f2..6f27bda 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5578,6 +5578,65 @@ simplify_cond_clz_ctz (rtx x, rtx_code cmp_code, rtx true_val, rtx false_val)
   return NULL_RTX;
 }
 
+/* X is an operand number OP of VEC_MERGE operation with MASK.
+   Try to simplify using knowledge that values outside of MASK
+   will not be used.  */
+
+rtx
+simplify_merge_mask (rtx x, rtx mask, int op)
+{
+  gcc_assert (VECTOR_MODE_P (GET_MODE (x)));
+  poly_uint64 nunits = GET_MODE_NUNITS (GET_MODE (x));
+  if (GET_CODE (x) == VEC_MERGE && rtx_equal_p (XEXP (x, 2), mask))
+    {
+      if (!side_effects_p (XEXP (x, 1 - op)))
+	return XEXP (x, op);
+    }
+  if (side_effects_p (x))
+    return NULL_RTX;
+  if (UNARY_P (x)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      if (top0)
+	return simplify_gen_unary (GET_CODE (x), GET_MODE (x), top0,
+				   GET_MODE (XEXP (x, 0)));
+    }
+  if (BINARY_P (x)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
+      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
+      if (top0 || top1)
+	return simplify_gen_binary (GET_CODE (x), GET_MODE (x),
+				    top0 ? top0 : XEXP (x, 0),
+				    top1 ? top1 : XEXP (x, 1));
+    }
+  if (GET_RTX_CLASS (GET_CODE (x)) == RTX_TERNARY
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
+      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 2)))
+      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 2))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
+      rtx top2 = simplify_merge_mask (XEXP (x, 2), mask, op);
+      if (top0 || top1)
+	return simplify_gen_ternary (GET_CODE (x), GET_MODE (x),
+				     GET_MODE (XEXP (x, 0)),
+				     top0 ? top0 : XEXP (x, 0),
+				     top1 ? top1 : XEXP (x, 1),
+				     top2 ? top2 : XEXP (x, 2));
+    }
+  return NULL_RTX;
+}
+
 \f
 /* Simplify CODE, an operation with result mode MODE and three operands,
    OP0, OP1, and OP2.  OP0_MODE was the mode of OP0 before it became
@@ -5967,6 +6026,28 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 	  && !side_effects_p (op2) && !side_effects_p (op1))
 	return op0;
 
+      if (!side_effects_p (op2))
+	{
+	  rtx top0 = simplify_merge_mask (op0, op2, 0);
+	  rtx top1 = simplify_merge_mask (op1, op2, 1);
+	  if (top0 || top1)
+	    return simplify_gen_ternary (code, mode, mode,
+					 top0 ? top0 : op0,
+					 top1 ? top1 : op1, op2);
+	}
+
+      if (GET_CODE (op0) == VEC_MERGE
+	  && rtx_equal_p (op2, XEXP (op0, 2))
+	  && !side_effects_p (XEXP (op0, 1)) && !side_effects_p (op2))
+	return simplify_gen_ternary (code, mode, mode,
+				     XEXP (op0, 0), op1, op2);
+
+      if (GET_CODE (op1) == VEC_MERGE
+	  && rtx_equal_p (op2, XEXP (op1, 2))
+	  && !side_effects_p (XEXP (op0, 0)) && !side_effects_p (op2))
+	return simplify_gen_ternary (code, mode, mode,
+				     XEXP (op0, 1), op1, op2);
+
       break;
 
     default:

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (10 preceding siblings ...)
  2018-09-05 11:50 ` [PATCH 12/25] Make default_static_chain return NULL in non-static functions ams
@ 2018-09-05 11:51 ` ams
  2018-09-17 19:31   ` Richard Sandiford
  2018-09-05 11:51 ` [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores ams
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1225 bytes --]


This feature probably ought to be reworked as a proper target hook, but I would
like to know if this is the correct solution to the problem first.

The problem is that GCN vectors have a fixed number of elements (64) and the
vector size varies with element size.  E.g. V64QI is 64 bytes and V64SI is 256
bytes.

This is a problem because GCC has an assumption that a) vector registers are
fixed size, and b) if there are multiple vector sizes you want to pick one size
and stick with it for the whole function.

This is a problem in various places, but mostly it's not fatal. However,
get_vectype_for_scalar_type caches the vector size for the first type it
encounters and then tries to apply that to all subsequent vectors, which
completely destroys vectorization.  The caching feature appears to be an
attempt to cope with AVX having a different vector size to other x86 vector
options.

This patch simply disables the cache so that it must ask the backend for the
preferred mode for every type.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* tree-vect-stmts.c (get_vectype_for_scalar_type): Implement
	TARGET_DISABLE_CURRENT_VECTOR_SIZE.
---
 gcc/tree-vect-stmts.c | 3 +++
 1 file changed, 3 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0013-Create-TARGET_DISABLE_CURRENT_VECTOR_SIZE.patch --]
[-- Type: text/x-patch; name="0013-Create-TARGET_DISABLE_CURRENT_VECTOR_SIZE.patch", Size: 578 bytes --]

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 607a2bd..8875201 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -9945,9 +9945,12 @@ get_vectype_for_scalar_type (tree scalar_type)
   tree vectype;
   vectype = get_vectype_for_scalar_type_and_size (scalar_type,
 						  current_vector_size);
+/* FIXME: use a proper target hook or macro.  */
+#ifndef TARGET_DISABLE_CURRENT_VECTOR_SIZE
   if (vectype
       && known_eq (current_vector_size, 0U))
     current_vector_size = GET_MODE_SIZE (TYPE_MODE (vectype));
+#endif
   return vectype;
 }
 

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 15/25] Don't double-count early-clobber matches.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (15 preceding siblings ...)
  2018-09-05 11:51 ` [PATCH 11/25] Simplify vec_merge according to the mask ams
@ 2018-09-05 11:51 ` ams
  2018-09-17  9:22   ` Richard Sandiford
  2018-09-05 11:51 ` [PATCH 16/25] Fix IRA ICE ams
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1324 bytes --]


Given a pattern with a number of operands:

(match_operand 0 "" "=&v")
(match_operand 1 "" " v0")
(match_operand 2 "" " v0")
(match_operand 3 "" " v0")

GCC will currently increment "reject" once, for operand 0, and then decrement
it once for each of the other operands, ending with reject == -2 and an
assertion failure.  If there's a conflict then it might try to decrement reject
yet again.

Incidentally, what these patterns are trying to achieve is an allocation in
which operand 0 may match one of the other operands, but may not partially
overlap any of them.  Ideally there'd be a better way to do this.

In any case, it will affect any pattern in which multiple operands may (or
must) match an early-clobber operand.

The patch only allows a reject-- when one has not already occurred, for that
operand.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* lra-constraints.c (process_alt_operands): Check
	matching_early_clobber before decrementing reject, and set
	matching_early_clobber after.
	* lra-int.h (struct lra_operand_data): Add matching_early_clobber.
	* lra.c (setup_operand_alternative): Initialize matching_early_clobber.
---
 gcc/lra-constraints.c | 22 ++++++++++++++--------
 gcc/lra-int.h         |  3 +++
 gcc/lra.c             |  1 +
 3 files changed, 18 insertions(+), 8 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0015-Don-t-double-count-early-clobber-matches.patch --]
[-- Type: text/x-patch; name="0015-Don-t-double-count-early-clobber-matches.patch", Size: 3004 bytes --]

diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 8be4d46..55163f1 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -2202,7 +2202,13 @@ process_alt_operands (int only_alternative)
 				 "            %d Matching earlyclobber alt:"
 				 " reject--\n",
 				 nop);
-			    reject--;
+			    if (!curr_static_id->operand[m]
+						 .matching_early_clobber)
+			      {
+				reject--;
+				curr_static_id->operand[m]
+						.matching_early_clobber = 1;
+			      }
 			  }
 			/* Otherwise we prefer no matching
 			   alternatives because it gives more freedom
@@ -2948,15 +2954,11 @@ process_alt_operands (int only_alternative)
 	      curr_alt_dont_inherit_ops[curr_alt_dont_inherit_ops_num++]
 		= last_conflict_j;
 	      losers++;
-	      /* Early clobber was already reflected in REJECT. */
-	      lra_assert (reject > 0);
 	      if (lra_dump_file != NULL)
 		fprintf
 		  (lra_dump_file,
 		   "            %d Conflict early clobber reload: reject--\n",
 		   i);
-	      reject--;
-	      overall += LRA_LOSER_COST_FACTOR - 1;
 	    }
 	  else
 	    {
@@ -2980,17 +2982,21 @@ process_alt_operands (int only_alternative)
 		}
 	      curr_alt_win[i] = curr_alt_match_win[i] = false;
 	      losers++;
-	      /* Early clobber was already reflected in REJECT. */
-	      lra_assert (reject > 0);
 	      if (lra_dump_file != NULL)
 		fprintf
 		  (lra_dump_file,
 		   "            %d Matched conflict early clobber reloads: "
 		   "reject--\n",
 		   i);
+	    }
+	  /* Early clobber was already reflected in REJECT. */
+	  if (!curr_static_id->operand[i].matching_early_clobber)
+	    {
+	      lra_assert (reject > 0);
 	      reject--;
-	      overall += LRA_LOSER_COST_FACTOR - 1;
+	      curr_static_id->operand[i].matching_early_clobber = 1;
 	    }
+	  overall += LRA_LOSER_COST_FACTOR - 1;
 	}
       if (lra_dump_file != NULL)
 	fprintf (lra_dump_file, "          alt=%d,overall=%d,losers=%d,rld_nregs=%d\n",
diff --git a/gcc/lra-int.h b/gcc/lra-int.h
index 5267b53..f193e1f 100644
--- a/gcc/lra-int.h
+++ b/gcc/lra-int.h
@@ -147,6 +147,9 @@ struct lra_operand_data
      This field is set up every time when corresponding
      operand_alternative in lra_static_insn_data is set up.  */
   unsigned int early_clobber : 1;
+  /* True if there is an early clobber that has a matching alternative.
+     This field is used to prevent multiple matches being counted.  */
+  unsigned int matching_early_clobber : 1;
   /* True if the operand is an address.  */
   unsigned int is_address : 1;
 };
diff --git a/gcc/lra.c b/gcc/lra.c
index aa768fb..01dd8b8 100644
--- a/gcc/lra.c
+++ b/gcc/lra.c
@@ -797,6 +797,7 @@ setup_operand_alternative (lra_insn_recog_data_t data,
     {
       static_data->operand[i].early_clobber_alts = 0;
       static_data->operand[i].early_clobber = false;
+      static_data->operand[i].matching_early_clobber = false;
       static_data->operand[i].is_address = false;
       if (static_data->operand[i].constraint[0] == '%')
 	{

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (11 preceding siblings ...)
  2018-09-05 11:51 ` [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE ams
@ 2018-09-05 11:51 ` ams
  2018-09-17  9:16   ` Richard Sandiford
  2018-09-05 11:51 ` [PATCH 17/25] Fix Fortran STOP ams
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 531 bytes --]


If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
blows away the register file and produces horrible code.

This patch simply disallows elementwise loads for such large vectors.  Is there
a better way to disable this in the middle-end?

2018-09-05  Julian Brown  <julian@codesourcery.com>

	gcc/
	* tree-vect-stmts.c (get_load_store_type): Don't use VMAT_ELEMENTWISE
	loads/stores with many-element (>=64) vectors.
---
 gcc/tree-vect-stmts.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0014-Disable-inefficient-vectorization-of-elementwise-loa.patch --]
[-- Type: text/x-patch; name="0014-Disable-inefficient-vectorization-of-elementwise-loa.patch", Size: 1152 bytes --]

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 8875201..a333991 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -2452,6 +2452,26 @@ get_load_store_type (stmt_vec_info stmt_info, tree vectype, bool slp,
 	*memory_access_type = VMAT_CONTIGUOUS;
     }
 
+  /* FIXME: Element-wise accesses can be extremely expensive if we have a
+     large number of elements to deal with (e.g. 64 for AMD GCN) using the
+     current generic code expansion.  Until an efficient code sequence is
+     supported for affected targets instead, don't attempt vectorization for
+     VMAT_ELEMENTWISE at all.  */
+  if (*memory_access_type == VMAT_ELEMENTWISE)
+    {
+      poly_uint64 nelements = TYPE_VECTOR_SUBPARTS (vectype);
+
+      if (maybe_ge (nelements, 64))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+	      "too many elements (%u) for elementwise accesses\n",
+	      (unsigned) nelements.to_constant ());
+
+	  return false;
+	}
+    }
+
   if ((*memory_access_type == VMAT_ELEMENTWISE
        || *memory_access_type == VMAT_STRIDED_SLP)
       && !nunits.is_constant ())

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 16/25] Fix IRA ICE.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (16 preceding siblings ...)
  2018-09-05 11:51 ` [PATCH 15/25] Don't double-count early-clobber matches ams
@ 2018-09-05 11:51 ` ams
  2018-09-17  9:36   ` Richard Sandiford
  2018-09-05 11:52 ` [PATCH 22/25] Add dg-require-effective-target exceptions ams
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]


The IRA pass makes an assumption that any pseudos created after the pass begins
were created explicitly by the pass itself and therefore will have
corresponding entries in its other tables.

The GCN back-end, however, often creates additional pseudos, in expand
patterns, to represent the necessary EXEC value, and these break IRA's
assumption and cause ICEs.

This patch simply has IRA skip unknown pseudos, and the problem goes away.

Presumably, it's not ideal that these registers have not been processed by IRA,
but it does not appear to do any real harm.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* ira.c (setup_preferred_alternate_classes_for_new_pseudos): Skip
	pseudos not created by this pass.
	(move_unallocated_pseudos): Likewise.
---
 gcc/ira.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0016-Fix-IRA-ICE.patch --]
[-- Type: text/x-patch; name="0016-Fix-IRA-ICE.patch", Size: 1128 bytes --]

diff --git a/gcc/ira.c b/gcc/ira.c
index def194a..e0c293c 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -2769,7 +2769,12 @@ setup_preferred_alternate_classes_for_new_pseudos (int start)
   for (i = start; i < max_regno; i++)
     {
       old_regno = ORIGINAL_REGNO (regno_reg_rtx[i]);
-      ira_assert (i != old_regno);
+
+      /* Skip any new pseudos not created directly by this pass.
+	 gen_move_insn can do this on AMD GCN, for example.  */
+      if (i == old_regno)
+	continue;
+
       setup_reg_classes (i, reg_preferred_class (old_regno),
 			 reg_alternate_class (old_regno),
 			 reg_allocno_class (old_regno));
@@ -5054,6 +5059,12 @@ move_unallocated_pseudos (void)
       {
 	int idx = i - first_moveable_pseudo;
 	rtx other_reg = pseudo_replaced_reg[idx];
+
+	/* Skip any new pseudos not created directly by find_moveable_pseudos.
+	   gen_move_insn can do this on AMD GCN, for example.  */
+	if (!other_reg)
+	  continue;
+
 	rtx_insn *def_insn = DF_REF_INSN (DF_REG_DEF_CHAIN (i));
 	/* The use must follow all definitions of OTHER_REG, so we can
 	   insert the new definition immediately after any of them.  */

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 17/25] Fix Fortran STOP.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (12 preceding siblings ...)
  2018-09-05 11:51 ` [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores ams
@ 2018-09-05 11:51 ` ams
       [not found]   ` <c0630914-1252-1391-9bf9-f03434d46f5a@moene.org>
  2018-09-05 11:51 ` [PATCH 18/25] Fix interleaving of Fortran stop messages ams
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 437 bytes --]


The minimal libgfortran setup was created for NVPTX, but will also be used by
AMD GCN.

This patch simply removes an assumption that NVPTX is the only user.
Specifically, NVPTX exit is broken, but AMD GCN exit works just fine.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	libgfortran/
	* runtime/minimal.c (exit): Only work around nvptx bugs on nvptx.
---
 libgfortran/runtime/minimal.c | 2 ++
 1 file changed, 2 insertions(+)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0017-Fix-Fortran-STOP.patch --]
[-- Type: text/x-patch; name="0017-Fix-Fortran-STOP.patch", Size: 554 bytes --]

diff --git a/libgfortran/runtime/minimal.c b/libgfortran/runtime/minimal.c
index 0b1efeb..8940f97 100644
--- a/libgfortran/runtime/minimal.c
+++ b/libgfortran/runtime/minimal.c
@@ -197,10 +197,12 @@ sys_abort (void)
 #define st_printf printf
 #undef estr_write
 #define estr_write printf
+#if __nvptx__
 /* Map "exit" to "abort"; see PR85463 '[nvptx] "exit" in offloaded region
    doesn't terminate process'.  */
 #undef exit
 #define exit(...) do { abort (); } while (0)
+#endif
 #undef exit_error
 #define exit_error(...) do { abort (); } while (0)
 

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 20/25] GCN libgcc.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (19 preceding siblings ...)
  2018-09-05 11:52 ` [PATCH 24/25] Ignore LLVM's blank lines ams
@ 2018-09-05 11:52 ` ams
  2018-09-05 12:32   ` Joseph Myers
  2018-11-09 18:49   ` Jeff Law
  2018-09-05 11:52 ` [PATCH 19/25] GCN libgfortran ams
                   ` (4 subsequent siblings)
  25 siblings, 2 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:52 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]


This patch contains the GCN port of libgcc.  I've broken it out just to keep
both parts more manageable.

We have the usual stuff, plus a "gomp_print" implementation intended to provide
a means to output text to console without using the full printf.  Originally
this was because we did not have a working Newlib port, but now it provides the
underlying mechanism for printf.  It's also much lighter than printf, and
therefore more suitable for debugging offload kernels (for which there is no
debugger, yet).

In order to work in offload kernels the same function must be present in both
host and GCN toolchains.  Therefore it needs to live in libgomp (hence the
name).  However, having found it also useful in stand alone testing I have
moved the GCN implementation to libgcc.

It was also necessary to provide a means to disable EMUTLS.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
	    Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	libgcc/
	* Makefile.in: Don't add emutls.c when --enable-emutls is "no".
	* config.host: Recognize amdgcn*-*-amdhsa.
	* config/gcn/crt0.c: New file.
	* config/gcn/gomp_print.c: New file.
	* config/gcn/lib2-divmod-hi.c: New file.
	* config/gcn/lib2-divmod.c: New file.
	* config/gcn/lib2-gcn.h: New file.
	* config/gcn/reduction.c: New file.
	* config/gcn/sfp-machine.h: New file.
	* config/gcn/t-amdgcn: New file.
---
 libgcc/Makefile.in                 |   2 +
 libgcc/config.host                 |   8 +++
 libgcc/config/gcn/crt0.c           |  23 ++++++++
 libgcc/config/gcn/gomp_print.c     |  99 +++++++++++++++++++++++++++++++
 libgcc/config/gcn/lib2-divmod-hi.c | 117 +++++++++++++++++++++++++++++++++++++
 libgcc/config/gcn/lib2-divmod.c    | 117 +++++++++++++++++++++++++++++++++++++
 libgcc/config/gcn/lib2-gcn.h       |  49 ++++++++++++++++
 libgcc/config/gcn/reduction.c      |  30 ++++++++++
 libgcc/config/gcn/sfp-machine.h    |  51 ++++++++++++++++
 libgcc/config/gcn/t-amdgcn         |  25 ++++++++
 10 files changed, 521 insertions(+)
 create mode 100644 libgcc/config/gcn/crt0.c
 create mode 100644 libgcc/config/gcn/gomp_print.c
 create mode 100644 libgcc/config/gcn/lib2-divmod-hi.c
 create mode 100644 libgcc/config/gcn/lib2-divmod.c
 create mode 100644 libgcc/config/gcn/lib2-gcn.h
 create mode 100644 libgcc/config/gcn/reduction.c
 create mode 100644 libgcc/config/gcn/sfp-machine.h
 create mode 100644 libgcc/config/gcn/t-amdgcn


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0020-GCN-libgcc.patch --]
[-- Type: text/x-patch; name="0020-GCN-libgcc.patch", Size: 16560 bytes --]

diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
index 0c5b264..6f68257 100644
--- a/libgcc/Makefile.in
+++ b/libgcc/Makefile.in
@@ -429,9 +429,11 @@ LIB2ADD += enable-execute-stack.c
 # While emutls.c has nothing to do with EH, it is in LIB2ADDEH*
 # instead of LIB2ADD because that's the way to be sure on some targets
 # (e.g. *-*-darwin*) only one copy of it is linked.
+ifneq ($(enable_emutls),no)
 LIB2ADDEH += $(srcdir)/emutls.c
 LIB2ADDEHSTATIC += $(srcdir)/emutls.c
 LIB2ADDEHSHARED += $(srcdir)/emutls.c
+endif
 
 # Library members defined in libgcc2.c.
 lib2funcs = _muldi3 _negdi2 _lshrdi3 _ashldi3 _ashrdi3 _cmpdi2 _ucmpdi2	   \
diff --git a/libgcc/config.host b/libgcc/config.host
index 029f656..29178da 100644
--- a/libgcc/config.host
+++ b/libgcc/config.host
@@ -91,6 +91,10 @@ alpha*-*-*)
 am33_2.0-*-linux*)
 	cpu_type=mn10300
 	;;
+amdgcn*-*-*)
+	cpu_type=gcn
+	tmake_file="${tmake_file} t-softfp-sfdf t-softfp"
+	;;
 arc*-*-*)
 	cpu_type=arc
 	;;
@@ -384,6 +388,10 @@ alpha*-dec-*vms*)
 	extra_parts="$extra_parts vms-dwarf2.o vms-dwarf2eh.o"
 	md_unwind_header=alpha/vms-unwind.h
 	;;
+amdgcn*-*-amdhsa)
+	tmake_file="$tmake_file gcn/t-amdgcn"
+	extra_parts="crt0.o"
+	;;
 arc*-*-elf*)
 	tmake_file="arc/t-arc"
 	extra_parts="crti.o crtn.o crtend.o crtbegin.o crtendS.o crtbeginS.o"
diff --git a/libgcc/config/gcn/crt0.c b/libgcc/config/gcn/crt0.c
new file mode 100644
index 0000000..f4f367b
--- /dev/null
+++ b/libgcc/config/gcn/crt0.c
@@ -0,0 +1,23 @@
+/* Copyright (C) 2017 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Provide an entry point symbol to silence a linker warning.  */
+void _start() {}
diff --git a/libgcc/config/gcn/gomp_print.c b/libgcc/config/gcn/gomp_print.c
new file mode 100644
index 0000000..41f50c3
--- /dev/null
+++ b/libgcc/config/gcn/gomp_print.c
@@ -0,0 +1,99 @@
+/* Newlib may not have been built yet.  */
+typedef long int64_t;
+typedef long size_t;
+extern char *strncpy (char *dst, const char *src, size_t length);
+extern void exit(int);
+
+void gomp_print_string (const char *msg, const char *value);
+void gomp_print_integer (const char *msg, int64_t value);
+void gomp_print_double (const char *msg, double value);
+
+/* This struct must match the one used by gcn-run and libgomp.
+   It holds all the data output from a kernel (besides mapping data).
+ 
+   The base address pointer can be found at kernargs+16.
+ 
+   The next_output counter must be atomically incremented for each
+   print output.  Only when the print data is fully written can the
+   "written" flag be set.  */
+struct output {
+  int return_value;
+  int next_output;
+  struct printf_data {
+    int written;
+    char msg[128];
+    int type;
+    union {
+      int64_t ivalue;
+      double dvalue;
+      char text[128];
+    };
+  } queue[1000];
+};
+
+static struct printf_data *
+reserve_print_slot (void) {
+  /* The kernargs pointer is in s[8:9].
+     This will break if the enable_sgpr_* flags are ever changed.  */
+  char *kernargs;
+  asm ("s_mov_b64 %0, s[8:9]" : "=Sg"(kernargs));
+
+  /* The output data is at kernargs[2].  */
+  struct output *data = *(struct output **)(kernargs + 16);
+
+  /* We don't have atomic operators in C yet.
+     "glc" means return original value.  */
+  int index = 0;
+  asm ("flat_atomic_add %0, %1, %2 glc\n\t"
+       "s_waitcnt 0"
+       : "=v"(index)
+       : "v"(&data->next_output), "v"(1), "e"(1l));
+
+  if (index >= 1000)
+    exit(1);
+
+  return &(data->queue[index]);
+}
+
+void
+gomp_print_string (const char *msg, const char *value)
+{
+  struct printf_data *output = reserve_print_slot ();
+  output->type = 2; /* String.  */
+
+  strncpy (output->msg, msg, 127);
+  output->msg[127] = '\0';
+  strncpy (output->text, value, 127);
+  output->text[127] = '\0';
+
+  asm ("" ::: "memory");
+  output->written = 1;
+}
+
+void
+gomp_print_integer (const char *msg, int64_t value)
+{
+  struct printf_data *output = reserve_print_slot ();
+  output->type = 0; /* Integer.  */
+
+  strncpy (output->msg, msg, 127);
+  output->msg[127] = '\0';
+  output->ivalue = value;
+
+  asm ("" ::: "memory");
+  output->written = 1;
+}
+
+void
+gomp_print_double (const char *msg, double value)
+{
+  struct printf_data *output = reserve_print_slot ();
+  output->type = 1; /* Double.  */
+
+  strncpy (output->msg, msg, 127);
+  output->msg[127] = '\0';
+  output->dvalue = value;
+
+  asm ("" ::: "memory");
+  output->written = 1;
+}
diff --git a/libgcc/config/gcn/lib2-divmod-hi.c b/libgcc/config/gcn/lib2-divmod-hi.c
new file mode 100644
index 0000000..d57e145
--- /dev/null
+++ b/libgcc/config/gcn/lib2-divmod-hi.c
@@ -0,0 +1,117 @@
+/* Copyright (C) 2012-2017 Free Software Foundation, Inc.
+   Contributed by Altera and Mentor Graphics, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+<http://www.gnu.org/licenses/>.  */
+
+#include "lib2-gcn.h"
+
+/* 16-bit HI divide and modulo as used in gcn.  */
+
+static UHItype
+udivmodhi4 (UHItype num, UHItype den, word_type modwanted)
+{
+  UHItype bit = 1;
+  UHItype res = 0;
+
+  while (den < num && bit && !(den & (1L<<15)))
+    {
+      den <<=1;
+      bit <<=1;
+    }
+  while (bit)
+    {
+      if (num >= den)
+	{
+	  num -= den;
+	  res |= bit;
+	}
+      bit >>=1;
+      den >>=1;
+    }
+  if (modwanted)
+    return num;
+  return res;
+}
+
+
+HItype
+__divhi3 (HItype a, HItype b)
+{
+  word_type neg = 0;
+  HItype res;
+
+  if (a < 0)
+    {
+      a = -a;
+      neg = !neg;
+    }
+
+  if (b < 0)
+    {
+      b = -b;
+      neg = !neg;
+    }
+
+  res = udivmodhi4 (a, b, 0);
+
+  if (neg)
+    res = -res;
+
+  return res;
+}
+
+
+HItype
+__modhi3 (HItype a, HItype b)
+{
+  word_type neg = 0;
+  HItype res;
+
+  if (a < 0)
+    {
+      a = -a;
+      neg = 1;
+    }
+
+  if (b < 0)
+    b = -b;
+
+  res = udivmodhi4 (a, b, 1);
+
+  if (neg)
+    res = -res;
+
+  return res;
+}
+
+
+UHItype
+__udivhi3 (UHItype a, UHItype b)
+{
+  return udivmodhi4 (a, b, 0);
+}
+
+
+UHItype
+__umodhi3 (UHItype a, UHItype b)
+{
+  return udivmodhi4 (a, b, 1);
+}
+
diff --git a/libgcc/config/gcn/lib2-divmod.c b/libgcc/config/gcn/lib2-divmod.c
new file mode 100644
index 0000000..08e7103
--- /dev/null
+++ b/libgcc/config/gcn/lib2-divmod.c
@@ -0,0 +1,117 @@
+/* Copyright (C) 2012-2017 Free Software Foundation, Inc.
+   Contributed by Altera and Mentor Graphics, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+<http://www.gnu.org/licenses/>.  */
+
+#include "lib2-gcn.h"
+
+/* 32-bit SI divide and modulo as used in gcn.  */
+
+static USItype
+udivmodsi4 (USItype num, USItype den, word_type modwanted)
+{
+  USItype bit = 1;
+  USItype res = 0;
+
+  while (den < num && bit && !(den & (1L<<31)))
+    {
+      den <<=1;
+      bit <<=1;
+    }
+  while (bit)
+    {
+      if (num >= den)
+	{
+	  num -= den;
+	  res |= bit;
+	}
+      bit >>=1;
+      den >>=1;
+    }
+  if (modwanted)
+    return num;
+  return res;
+}
+
+
+SItype
+__divsi3 (SItype a, SItype b)
+{
+  word_type neg = 0;
+  SItype res;
+
+  if (a < 0)
+    {
+      a = -a;
+      neg = !neg;
+    }
+
+  if (b < 0)
+    {
+      b = -b;
+      neg = !neg;
+    }
+
+  res = udivmodsi4 (a, b, 0);
+
+  if (neg)
+    res = -res;
+
+  return res;
+}
+
+
+SItype
+__modsi3 (SItype a, SItype b)
+{
+  word_type neg = 0;
+  SItype res;
+
+  if (a < 0)
+    {
+      a = -a;
+      neg = 1;
+    }
+
+  if (b < 0)
+    b = -b;
+
+  res = udivmodsi4 (a, b, 1);
+
+  if (neg)
+    res = -res;
+
+  return res;
+}
+
+
+SItype
+__udivsi3 (SItype a, SItype b)
+{
+  return udivmodsi4 (a, b, 0);
+}
+
+
+SItype
+__umodsi3 (SItype a, SItype b)
+{
+  return udivmodsi4 (a, b, 1);
+}
+
diff --git a/libgcc/config/gcn/lib2-gcn.h b/libgcc/config/gcn/lib2-gcn.h
new file mode 100644
index 0000000..aff0bd2
--- /dev/null
+++ b/libgcc/config/gcn/lib2-gcn.h
@@ -0,0 +1,49 @@
+/* Integer arithmetic support for gcn.
+
+   Copyright (C) 2012-2017 Free Software Foundation, Inc.
+   Contributed by Altera and Mentor Graphics, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef LIB2_GCN_H
+#define LIB2_GCN_H
+
+/* Types.  */
+
+typedef char QItype __attribute__ ((mode (QI)));
+typedef unsigned char UQItype __attribute__ ((mode (QI)));
+typedef short HItype __attribute__ ((mode (HI)));
+typedef unsigned short UHItype __attribute__ ((mode (HI)));
+typedef int SItype __attribute__ ((mode (SI)));
+typedef unsigned int USItype __attribute__ ((mode (SI)));
+typedef int word_type __attribute__ ((mode (__word__)));
+
+/* Exported functions.  */
+extern SItype __divsi3 (SItype, SItype);
+extern SItype __modsi3 (SItype, SItype);
+extern SItype __udivsi3 (SItype, SItype);
+extern SItype __umodsi3 (SItype, SItype);
+extern HItype __divhi3 (HItype, HItype);
+extern HItype __modhi3 (HItype, HItype);
+extern UHItype __udivhi3 (UHItype, UHItype);
+extern UHItype __umodhi3 (UHItype, UHItype);
+extern SItype __mulsi3 (SItype, SItype);
+
+#endif /* LIB2_GCN_H */
diff --git a/libgcc/config/gcn/reduction.c b/libgcc/config/gcn/reduction.c
new file mode 100644
index 0000000..fbe9aaa
--- /dev/null
+++ b/libgcc/config/gcn/reduction.c
@@ -0,0 +1,30 @@
+/* Oversized reductions lock variable
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   Contributed by Mentor Graphics.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+<http://www.gnu.org/licenses/>.  */
+
+/* We use a global lock variable for reductions on objects larger than
+   64 bits.  Until and unless proven that lock contention for
+   different reductions is a problem, a single lock will suffice.  */
+
+unsigned volatile __reduction_lock = 0;
diff --git a/libgcc/config/gcn/sfp-machine.h b/libgcc/config/gcn/sfp-machine.h
new file mode 100644
index 0000000..7874081
--- /dev/null
+++ b/libgcc/config/gcn/sfp-machine.h
@@ -0,0 +1,51 @@
+/* Use 32-bit types here to prevent longlong.h trying to use TImode.
+   Once TImode works we might be better to use 64-bit here.  */
+
+#define _FP_W_TYPE_SIZE		32
+#define _FP_W_TYPE		unsigned int
+#define _FP_WS_TYPE		signed int
+#define _FP_I_TYPE		int
+
+#define _FP_MUL_MEAT_S(R,X,Y)				\
+  _FP_MUL_MEAT_1_wide(_FP_WFRACBITS_S,R,X,Y,umul_ppmm)
+#define _FP_MUL_MEAT_D(R,X,Y)				\
+  _FP_MUL_MEAT_2_wide(_FP_WFRACBITS_D,R,X,Y,umul_ppmm)
+
+#define _FP_DIV_MEAT_S(R,X,Y)	_FP_DIV_MEAT_1_loop(S,R,X,Y)
+#define _FP_DIV_MEAT_D(R,X,Y)	_FP_DIV_MEAT_2_udiv(D,R,X,Y)
+
+#define _FP_NANFRAC_S		((_FP_QNANBIT_S << 1) - 1)
+#define _FP_NANFRAC_D		((_FP_QNANBIT_D << 1) - 1), -1
+#define _FP_NANSIGN_S		0
+#define _FP_NANSIGN_D		0
+
+#define _FP_KEEPNANFRACP 1
+#define _FP_QNANNEGATEDP 0
+
+/* Someone please check this.  */
+#define _FP_CHOOSENAN(fs, wc, R, X, Y, OP)			\
+  do {								\
+    if ((_FP_FRAC_HIGH_RAW_##fs(X) & _FP_QNANBIT_##fs)		\
+	&& !(_FP_FRAC_HIGH_RAW_##fs(Y) & _FP_QNANBIT_##fs))	\
+      {								\
+	R##_s = Y##_s;						\
+	_FP_FRAC_COPY_##wc(R,Y);				\
+      }								\
+    else							\
+      {								\
+	R##_s = X##_s;						\
+	_FP_FRAC_COPY_##wc(R,X);				\
+      }								\
+    R##_c = FP_CLS_NAN;						\
+  } while (0)
+
+#define _FP_TININESS_AFTER_ROUNDING 0
+
+#define __LITTLE_ENDIAN 1234
+#define	__BIG_ENDIAN	4321
+#define __BYTE_ORDER __LITTLE_ENDIAN
+
+/* Define ALIASNAME as a strong alias for NAME.  */
+# define strong_alias(name, aliasname) _strong_alias(name, aliasname)
+# define _strong_alias(name, aliasname) \
+  extern __typeof (name) aliasname __attribute__ ((alias (#name)));
diff --git a/libgcc/config/gcn/t-amdgcn b/libgcc/config/gcn/t-amdgcn
new file mode 100644
index 0000000..d0c423d
--- /dev/null
+++ b/libgcc/config/gcn/t-amdgcn
@@ -0,0 +1,25 @@
+LIB2ADD += $(srcdir)/config/gcn/gomp_print.c
+
+LIB2ADD += $(srcdir)/config/gcn/lib2-divmod.c \
+	   $(srcdir)/config/gcn/lib2-divmod-hi.c
+
+LIB2ADD += $(srcdir)/config/gcn/reduction.c
+
+LIB2ADDEH=
+LIB2FUNCS_EXCLUDE=__main
+
+override LIB2FUNCS_ST := $(filter-out __gcc_bcmp,$(LIB2FUNCS_ST))
+
+# Debug information is not useful, and probably uses broken relocations
+LIBGCC2_DEBUG_CFLAGS = -g0
+
+crt0.o: $(srcdir)/config/gcn/crt0.c
+	$(crt_compile) -c $<
+
+# Prevent building "advanced" stuff (for example, gcov support).  We don't
+# support it, and it may cause the build to fail, because of alloca usage, for
+# example.
+INHIBIT_LIBC_CFLAGS = -Dinhibit_libc
+
+# Disable emutls.c (temporarily?)
+enable_emutls = no

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 23/25] Testsuite: GCN is always PIE.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (21 preceding siblings ...)
  2018-09-05 11:52 ` [PATCH 19/25] GCN libgfortran ams
@ 2018-09-05 11:52 ` ams
  2018-09-14 16:39   ` Jeff Law
  2018-09-05 11:53 ` [PATCH 25/25] Port testsuite to GCN ams
                   ` (2 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:52 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]


The GCN/HSA loader ignores the load address and uses a random location, so we
build all GCN binaries as PIE, by default.

This patch makes the necessary testsuite adjustments to make this work
correctly.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/testsuite/
	* gcc.dg/graphite/scop-19.c: Check pie_enabled.
	* gcc.dg/pic-1.c: Disable on amdgcn.
	* gcc.dg/pic-2.c: Disable on amdgcn.
	* gcc.dg/pic-3.c: Disable on amdgcn.
	* gcc.dg/pic-4.c: Disable on amdgcn.
	* gcc.dg/pie-3.c: Disable on amdgcn.
	* gcc.dg/pie-4.c: Disable on amdgcn.
	* gcc.dg/uninit-19.c: Check pie_enabled.
	* lib/target-supports.exp (check_effective_target_pie): Add amdgcn.
---
 gcc/testsuite/gcc.dg/graphite/scop-19.c | 4 ++--
 gcc/testsuite/gcc.dg/pic-1.c            | 2 +-
 gcc/testsuite/gcc.dg/pic-2.c            | 1 +
 gcc/testsuite/gcc.dg/pic-3.c            | 2 +-
 gcc/testsuite/gcc.dg/pic-4.c            | 2 +-
 gcc/testsuite/gcc.dg/pie-3.c            | 2 +-
 gcc/testsuite/gcc.dg/pie-4.c            | 2 +-
 gcc/testsuite/gcc.dg/uninit-19.c        | 4 ++--
 gcc/testsuite/lib/target-supports.exp   | 3 ++-
 9 files changed, 12 insertions(+), 10 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0023-Testsuite-GCN-is-always-PIE.patch --]
[-- Type: text/x-patch; name="0023-Testsuite-GCN-is-always-PIE.patch", Size: 4782 bytes --]

diff --git a/gcc/testsuite/gcc.dg/graphite/scop-19.c b/gcc/testsuite/gcc.dg/graphite/scop-19.c
index c89717b..6028132 100644
--- a/gcc/testsuite/gcc.dg/graphite/scop-19.c
+++ b/gcc/testsuite/gcc.dg/graphite/scop-19.c
@@ -31,6 +31,6 @@ d_growable_string_append_buffer (struct d_growable_string *dgs,
   if (need > dgs->alc)
     d_growable_string_resize (dgs, need);
 }
-/* { dg-final { scan-tree-dump-times "number of SCoPs: 0" 2 "graphite" { target nonpic } } } */
-/* { dg-final { scan-tree-dump-times "number of SCoPs: 0" 1 "graphite" { target { ! nonpic } } } } */
+/* { dg-final { scan-tree-dump-times "number of SCoPs: 0" 2 "graphite" { target { nonpic || pie_enabled } } } } */
+/* { dg-final { scan-tree-dump-times "number of SCoPs: 0" 1 "graphite" { target { ! { nonpic || pie_enabled } } } } } */
 
diff --git a/gcc/testsuite/gcc.dg/pic-1.c b/gcc/testsuite/gcc.dg/pic-1.c
index 82ba43d..4bb332e 100644
--- a/gcc/testsuite/gcc.dg/pic-1.c
+++ b/gcc/testsuite/gcc.dg/pic-1.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! { *-*-darwin* hppa*-*-* } } } } */
+/* { dg-do compile { target { ! { *-*-darwin* hppa*-*-* amdgcn*-*-* } } } } */
 /* { dg-require-effective-target fpic } */
 /* { dg-options "-fpic" } */
 
diff --git a/gcc/testsuite/gcc.dg/pic-2.c b/gcc/testsuite/gcc.dg/pic-2.c
index bccec13..3846ec4 100644
--- a/gcc/testsuite/gcc.dg/pic-2.c
+++ b/gcc/testsuite/gcc.dg/pic-2.c
@@ -2,6 +2,7 @@
 /* { dg-require-effective-target fpic } */
 /* { dg-options "-fPIC" } */
 /* { dg-skip-if "__PIC__ is always 1 for MIPS" { mips*-*-* } } */
+/* { dg-skip-if "__PIE__ is always defined for GCN" { amdgcn*-*-* } } */
 
 #if __PIC__ != 2
 # error __PIC__ is not 2!
diff --git a/gcc/testsuite/gcc.dg/pic-3.c b/gcc/testsuite/gcc.dg/pic-3.c
index c56f06f..1397977 100644
--- a/gcc/testsuite/gcc.dg/pic-3.c
+++ b/gcc/testsuite/gcc.dg/pic-3.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* } } } } */
+/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* amdgcn*-*-* } } } } */
 /* { dg-options "-fno-pic" } */
 
 #ifdef __PIC__
diff --git a/gcc/testsuite/gcc.dg/pic-4.c b/gcc/testsuite/gcc.dg/pic-4.c
index 2afdd99..d6d9dc9 100644
--- a/gcc/testsuite/gcc.dg/pic-4.c
+++ b/gcc/testsuite/gcc.dg/pic-4.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* } } } } */
+/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* amdgcn*-*-* } } } } */
 /* { dg-options "-fno-PIC" } */
 
 #ifdef __PIC__
diff --git a/gcc/testsuite/gcc.dg/pie-3.c b/gcc/testsuite/gcc.dg/pie-3.c
index 5577437..fd4a48d 100644
--- a/gcc/testsuite/gcc.dg/pie-3.c
+++ b/gcc/testsuite/gcc.dg/pie-3.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* } } } } */
+/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* amdgcn*-*-* } } } } */
 /* { dg-options "-fno-pie" } */
 
 #ifdef __PIC__
diff --git a/gcc/testsuite/gcc.dg/pie-4.c b/gcc/testsuite/gcc.dg/pie-4.c
index 4134676..5523602 100644
--- a/gcc/testsuite/gcc.dg/pie-4.c
+++ b/gcc/testsuite/gcc.dg/pie-4.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* } } } } */
+/* { dg-do compile { target { ! { *-*-darwin* hppa*64*-*-* mips*-*-linux-* amdgcn*-*-* } } } } */
 /* { dg-options "-fno-PIE" } */
 
 #ifdef __PIC__
diff --git a/gcc/testsuite/gcc.dg/uninit-19.c b/gcc/testsuite/gcc.dg/uninit-19.c
index 094dc0e..3f5f06a 100644
--- a/gcc/testsuite/gcc.dg/uninit-19.c
+++ b/gcc/testsuite/gcc.dg/uninit-19.c
@@ -12,7 +12,7 @@ fn1 (int p1, float *f1, float *f2, float *f3, unsigned char *c1, float *f4,
 {
   if (p1 & 8)
     b[3] = p10[a];
-  /* { dg-warning "may be used uninitialized" "" { target { { nonpic } || { hppa*64*-*-* } } } .-1 } */
+  /* { dg-warning "may be used uninitialized" "" { target { { nonpic || pie_enabled } || { hppa*64*-*-* } } } .-1 } */
 }
 
 void
@@ -22,5 +22,5 @@ fn2 ()
   if (l & 6)
     n = &c + m;
   fn1 (l, &d, &e, &g, &i, &h, &k, n);
-  /* { dg-warning "may be used uninitialized" "" { target { ! { { nonpic } || { hppa*64*-*-* } } } } .-1 } */
+  /* { dg-warning "may be used uninitialized" "" { target { ! { { nonpic || pie_enabled } || { hppa*64*-*-* } } } } .-1 } */
 }
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index e27bed0..61442bd 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -1185,7 +1185,8 @@ proc check_effective_target_pie { } {
 	 || [istarget *-*-dragonfly*]
 	 || [istarget *-*-freebsd*]
 	 || [istarget *-*-linux*]
-	 || [istarget *-*-gnu*] } {
+	 || [istarget *-*-gnu*]
+	 || [istarget *-*-amdhsa]} {
 	return 1;
     }
     if { [istarget *-*-solaris2.1\[1-9\]*] } {

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 19/25] GCN libgfortran.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (20 preceding siblings ...)
  2018-09-05 11:52 ` [PATCH 20/25] GCN libgcc ams
@ 2018-09-05 11:52 ` ams
       [not found]   ` <41281e27-ad85-e50c-8fed-6f4f6f18289c@moene.org>
  2018-09-11 22:47   ` Jeff Law
  2018-09-05 11:52 ` [PATCH 23/25] Testsuite: GCN is always PIE ams
                   ` (3 subsequent siblings)
  25 siblings, 2 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:52 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 588 bytes --]


This patch contains the GCN port of libgfortran.  We use the minimal
configuration created for NVPTX.  That's all that's required, besides the
target-independent bug fixes posted already.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
	    Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	libgfortran/
	* configure.ac: Use minimal mode for amdgcn.
	* configure: Regenerate.
---
 libgfortran/configure    | 7 ++++---
 libgfortran/configure.ac | 3 ++-
 2 files changed, 6 insertions(+), 4 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0019-GCN-libgfortran.patch --]
[-- Type: text/x-patch; name="0019-GCN-libgfortran.patch", Size: 1600 bytes --]

diff --git a/libgfortran/configure b/libgfortran/configure
index a583b67..fd8b697 100755
--- a/libgfortran/configure
+++ b/libgfortran/configure
@@ -5994,7 +5994,8 @@ fi
 # * C library support for other features such as signal, environment
 #   variables, time functions
 
- if test "x${target_cpu}" = xnvptx; then
+ if test "x${target_cpu}" = xnvptx \
+				 || test "x${target_cpu}" = xamdgcn; then
   LIBGFOR_MINIMAL_TRUE=
   LIBGFOR_MINIMAL_FALSE='#'
 else
@@ -12514,7 +12515,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 12517 "configure"
+#line 12518 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -12620,7 +12621,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 12623 "configure"
+#line 12624 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
diff --git a/libgfortran/configure.ac b/libgfortran/configure.ac
index 05952aa..11b629d 100644
--- a/libgfortran/configure.ac
+++ b/libgfortran/configure.ac
@@ -206,7 +206,8 @@ AM_CONDITIONAL(LIBGFOR_USE_SYMVER_SUN, [test "x$gfortran_use_symver" = xsun])
 # * C library support for other features such as signal, environment
 #   variables, time functions
 
-AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx])
+AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx \
+				 || test "x${target_cpu}" = xamdgcn])
 
 # Figure out whether the compiler supports "-ffunction-sections -fdata-sections",
 # similarly to how libstdc++ does it

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 22/25] Add dg-require-effective-target exceptions
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (17 preceding siblings ...)
  2018-09-05 11:51 ` [PATCH 16/25] Fix IRA ICE ams
@ 2018-09-05 11:52 ` ams
  2018-09-17  9:40   ` Richard Sandiford
  2018-09-17 17:53   ` Mike Stump
  2018-09-05 11:52 ` [PATCH 24/25] Ignore LLVM's blank lines ams
                   ` (6 subsequent siblings)
  25 siblings, 2 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:52 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]


There are a number of tests that fail because they assume that exceptions are
available, but GCN does not support them, yet.

This patch adds "dg-require-effective-target exceptions" in all the affected
tests.  There's probably an automatic way to test for exceptions, but the
current implementation simply says that AMD GCN does not support them.  This
should ensure that no other targets are affected by the change.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
	    Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	gcc/testsuite/
	* c-c++-common/ubsan/pr71512-1.c: Require exceptions.
	* c-c++-common/ubsan/pr71512-2.c: Require exceptions.
	* gcc.c-torture/compile/pr34648.c: Require exceptions.
	* gcc.c-torture/compile/pr41469.c: Require exceptions.
	* gcc.dg/20111216-1.c: Require exceptions.
	* gcc.dg/cleanup-10.c: Require exceptions.
	* gcc.dg/cleanup-11.c: Require exceptions.
	* gcc.dg/cleanup-12.c: Require exceptions.
	* gcc.dg/cleanup-13.c: Require exceptions.
	* gcc.dg/cleanup-5.c: Require exceptions.
	* gcc.dg/cleanup-8.c: Require exceptions.
	* gcc.dg/cleanup-9.c: Require exceptions.
	* gcc.dg/gomp/pr29955.c: Require exceptions.
	* gcc.dg/lto/pr52097_0.c: Require exceptions.
	* gcc.dg/nested-func-5.c: Require exceptions.
	* gcc.dg/pch/except-1.c: Require exceptions.
	* gcc.dg/pch/valid-2.c: Require exceptions.
	* gcc.dg/pr41470.c: Require exceptions.
	* gcc.dg/pr42427.c: Require exceptions.
	* gcc.dg/pr44545.c: Require exceptions.
	* gcc.dg/pr47086.c: Require exceptions.
	* gcc.dg/pr51481.c: Require exceptions.
	* gcc.dg/pr51644.c: Require exceptions.
	* gcc.dg/pr52046.c: Require exceptions.
	* gcc.dg/pr54669.c: Require exceptions.
	* gcc.dg/pr56424.c: Require exceptions.
	* gcc.dg/pr64465.c: Require exceptions.
	* gcc.dg/pr65802.c: Require exceptions.
	* gcc.dg/pr67563.c: Require exceptions.
	* gcc.dg/tree-ssa/pr41469-1.c: Require exceptions.
	* gcc.dg/tree-ssa/ssa-dse-28.c: Require exceptions.
	* gcc.dg/vect/pr46663.c: Require exceptions.
	* lib/target-supports.exp (check_effective_target_exceptions): New.
---
 gcc/testsuite/c-c++-common/ubsan/pr71512-1.c  |  1 +
 gcc/testsuite/c-c++-common/ubsan/pr71512-2.c  |  1 +
 gcc/testsuite/gcc.c-torture/compile/pr34648.c |  1 +
 gcc/testsuite/gcc.c-torture/compile/pr41469.c |  1 +
 gcc/testsuite/gcc.dg/20111216-1.c             |  1 +
 gcc/testsuite/gcc.dg/cleanup-10.c             |  1 +
 gcc/testsuite/gcc.dg/cleanup-11.c             |  1 +
 gcc/testsuite/gcc.dg/cleanup-12.c             |  1 +
 gcc/testsuite/gcc.dg/cleanup-13.c             |  1 +
 gcc/testsuite/gcc.dg/cleanup-5.c              |  1 +
 gcc/testsuite/gcc.dg/cleanup-8.c              |  1 +
 gcc/testsuite/gcc.dg/cleanup-9.c              |  1 +
 gcc/testsuite/gcc.dg/gomp/pr29955.c           |  1 +
 gcc/testsuite/gcc.dg/lto/pr52097_0.c          |  1 +
 gcc/testsuite/gcc.dg/nested-func-5.c          |  1 +
 gcc/testsuite/gcc.dg/pch/except-1.c           |  1 +
 gcc/testsuite/gcc.dg/pch/valid-2.c            |  2 +-
 gcc/testsuite/gcc.dg/pr41470.c                |  1 +
 gcc/testsuite/gcc.dg/pr42427.c                |  1 +
 gcc/testsuite/gcc.dg/pr44545.c                |  1 +
 gcc/testsuite/gcc.dg/pr47086.c                |  1 +
 gcc/testsuite/gcc.dg/pr51481.c                |  1 +
 gcc/testsuite/gcc.dg/pr51644.c                |  1 +
 gcc/testsuite/gcc.dg/pr52046.c                |  1 +
 gcc/testsuite/gcc.dg/pr54669.c                |  1 +
 gcc/testsuite/gcc.dg/pr56424.c                |  1 +
 gcc/testsuite/gcc.dg/pr64465.c                |  1 +
 gcc/testsuite/gcc.dg/pr65802.c                |  1 +
 gcc/testsuite/gcc.dg/pr67563.c                |  1 +
 gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c     |  1 +
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c    |  1 +
 gcc/testsuite/gcc.dg/vect/pr46663.c           |  1 +
 gcc/testsuite/lib/target-supports.exp         | 10 ++++++++++
 33 files changed, 42 insertions(+), 1 deletion(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0022-Add-dg-require-effective-target-exceptions.patch --]
[-- Type: text/x-patch; name="0022-Add-dg-require-effective-target-exceptions.patch", Size: 14522 bytes --]

diff --git a/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c b/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c
index 2a90ab1..8af9365 100644
--- a/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c
+++ b/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c
@@ -1,5 +1,6 @@
 /* PR c/71512 */
 /* { dg-do compile } */
 /* { dg-options "-O2 -fnon-call-exceptions -ftrapv -fexceptions -fsanitize=undefined" } */
+/* { dg-require-effective-target exceptions } */
 
 #include "../../gcc.dg/pr44545.c"
diff --git a/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c b/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c
index 1c95593..0c16934 100644
--- a/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c
+++ b/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c
@@ -1,5 +1,6 @@
 /* PR c/71512 */
 /* { dg-do compile } */
 /* { dg-options "-O -fexceptions -fnon-call-exceptions -ftrapv -fsanitize=undefined" } */
+/* { dg-require-effective-target exceptions } */
 
 #include "../../gcc.dg/pr47086.c"
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr34648.c b/gcc/testsuite/gcc.c-torture/compile/pr34648.c
index 8bcdae0..90a88b9 100644
--- a/gcc/testsuite/gcc.c-torture/compile/pr34648.c
+++ b/gcc/testsuite/gcc.c-torture/compile/pr34648.c
@@ -1,6 +1,7 @@
 /* PR tree-optimization/34648 */
 
 /* { dg-options "-fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 extern const unsigned short int **bar (void) __attribute__ ((const));
 const char *a;
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr41469.c b/gcc/testsuite/gcc.c-torture/compile/pr41469.c
index 5917794..923bca2 100644
--- a/gcc/testsuite/gcc.c-torture/compile/pr41469.c
+++ b/gcc/testsuite/gcc.c-torture/compile/pr41469.c
@@ -1,5 +1,6 @@
 /* { dg-options "-fexceptions" } */
 /* { dg-skip-if "requires alloca" { ! alloca } { "-O0" } { "" } } */
+/* { dg-require-effective-target exceptions } */
 
 void
 af (void *a)
diff --git a/gcc/testsuite/gcc.dg/20111216-1.c b/gcc/testsuite/gcc.dg/20111216-1.c
index cd82cf9..7f9395e 100644
--- a/gcc/testsuite/gcc.dg/20111216-1.c
+++ b/gcc/testsuite/gcc.dg/20111216-1.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O -fexceptions -fnon-call-exceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 extern void f2 () __attribute__ ((noreturn));
 void
diff --git a/gcc/testsuite/gcc.dg/cleanup-10.c b/gcc/testsuite/gcc.dg/cleanup-10.c
index 16035b1..1af63ea 100644
--- a/gcc/testsuite/gcc.dg/cleanup-10.c
+++ b/gcc/testsuite/gcc.dg/cleanup-10.c
@@ -1,5 +1,6 @@
 /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
 /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
+/* { dg-require-effective-target exceptions } */
 /* Verify that cleanups work with exception handling through signal frames
    on alternate stack.  */
 
diff --git a/gcc/testsuite/gcc.dg/cleanup-11.c b/gcc/testsuite/gcc.dg/cleanup-11.c
index ccc61ed..c1f19fe 100644
--- a/gcc/testsuite/gcc.dg/cleanup-11.c
+++ b/gcc/testsuite/gcc.dg/cleanup-11.c
@@ -1,5 +1,6 @@
 /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
 /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
+/* { dg-require-effective-target exceptions } */
 /* Verify that cleanups work with exception handling through realtime signal
    frames on alternate stack.  */
 
diff --git a/gcc/testsuite/gcc.dg/cleanup-12.c b/gcc/testsuite/gcc.dg/cleanup-12.c
index efb9a58..2171e35 100644
--- a/gcc/testsuite/gcc.dg/cleanup-12.c
+++ b/gcc/testsuite/gcc.dg/cleanup-12.c
@@ -4,6 +4,7 @@
 /* { dg-options "-O2 -fexceptions" } */
 /* { dg-skip-if "" { "ia64-*-hpux11.*" } } */
 /* { dg-skip-if "" { ! nonlocal_goto } } */
+/* { dg-require-effective-target exceptions } */
 /* Verify unwind info in presence of alloca.  */
 
 #include <unwind.h>
diff --git a/gcc/testsuite/gcc.dg/cleanup-13.c b/gcc/testsuite/gcc.dg/cleanup-13.c
index 8a8db27..1b7ea5c 100644
--- a/gcc/testsuite/gcc.dg/cleanup-13.c
+++ b/gcc/testsuite/gcc.dg/cleanup-13.c
@@ -3,6 +3,7 @@
 /* { dg-options "-fexceptions" } */
 /* { dg-skip-if "" { "ia64-*-hpux11.*" } } */
 /* { dg-skip-if "" { ! nonlocal_goto } } */
+/* { dg-require-effective-target exceptions } */
 /* Verify DW_OP_* handling in the unwinder.  */
 
 #include <unwind.h>
diff --git a/gcc/testsuite/gcc.dg/cleanup-5.c b/gcc/testsuite/gcc.dg/cleanup-5.c
index 4257f9e..9ed2a7c 100644
--- a/gcc/testsuite/gcc.dg/cleanup-5.c
+++ b/gcc/testsuite/gcc.dg/cleanup-5.c
@@ -3,6 +3,7 @@
 /* { dg-options "-fexceptions" } */
 /* { dg-skip-if "" { "ia64-*-hpux11.*" } } */
 /* { dg-skip-if "" { ! nonlocal_goto } } */
+/* { dg-require-effective-target exceptions } */
 /* Verify that cleanups work with exception handling.  */
 
 #include <unwind.h>
diff --git a/gcc/testsuite/gcc.dg/cleanup-8.c b/gcc/testsuite/gcc.dg/cleanup-8.c
index 553c038..45abdb2 100644
--- a/gcc/testsuite/gcc.dg/cleanup-8.c
+++ b/gcc/testsuite/gcc.dg/cleanup-8.c
@@ -1,5 +1,6 @@
 /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
 /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
+/* { dg-require-effective-target exceptions } */
 /* Verify that cleanups work with exception handling through signal
    frames.  */
 
diff --git a/gcc/testsuite/gcc.dg/cleanup-9.c b/gcc/testsuite/gcc.dg/cleanup-9.c
index fe28072..98dc268 100644
--- a/gcc/testsuite/gcc.dg/cleanup-9.c
+++ b/gcc/testsuite/gcc.dg/cleanup-9.c
@@ -1,5 +1,6 @@
 /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
 /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
+/* { dg-require-effective-target exceptions } */
 /* Verify that cleanups work with exception handling through realtime
    signal frames.  */
 
diff --git a/gcc/testsuite/gcc.dg/gomp/pr29955.c b/gcc/testsuite/gcc.dg/gomp/pr29955.c
index e49c11c..102898c 100644
--- a/gcc/testsuite/gcc.dg/gomp/pr29955.c
+++ b/gcc/testsuite/gcc.dg/gomp/pr29955.c
@@ -1,6 +1,7 @@
 /* PR c/29955 */
 /* { dg-do compile } */
 /* { dg-options "-O2 -fopenmp -fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 extern void bar (int);
 
diff --git a/gcc/testsuite/gcc.dg/lto/pr52097_0.c b/gcc/testsuite/gcc.dg/lto/pr52097_0.c
index cd4af5d..1b3fda3 100644
--- a/gcc/testsuite/gcc.dg/lto/pr52097_0.c
+++ b/gcc/testsuite/gcc.dg/lto/pr52097_0.c
@@ -1,5 +1,6 @@
 /* { dg-lto-do link } */
 /* { dg-lto-options { { -O -flto -fexceptions -fnon-call-exceptions --param allow-store-data-races=0 } } } */
+/* { dg-require-effective-target exceptions } */
 
 typedef struct { unsigned int e0 : 16; } s1;
 typedef struct { unsigned int e0 : 16; } s2;
diff --git a/gcc/testsuite/gcc.dg/nested-func-5.c b/gcc/testsuite/gcc.dg/nested-func-5.c
index 3545f37..591f8a2 100644
--- a/gcc/testsuite/gcc.dg/nested-func-5.c
+++ b/gcc/testsuite/gcc.dg/nested-func-5.c
@@ -2,6 +2,7 @@
 /* { dg-options "-fexceptions" } */
 /* PR28516: ICE generating ARM unwind directives for nested functions.  */
 /* { dg-require-effective-target trampolines } */
+/* { dg-require-effective-target exceptions } */
 
 void ex(int (*)(void));
 void foo(int i)
diff --git a/gcc/testsuite/gcc.dg/pch/except-1.c b/gcc/testsuite/gcc.dg/pch/except-1.c
index f81b098..30350ed 100644
--- a/gcc/testsuite/gcc.dg/pch/except-1.c
+++ b/gcc/testsuite/gcc.dg/pch/except-1.c
@@ -1,4 +1,5 @@
 /* { dg-options "-fexceptions -I." } */
+/* { dg-require-effective-target exceptions } */
 #include "except-1.h"
 
 int main(void) 
diff --git a/gcc/testsuite/gcc.dg/pch/valid-2.c b/gcc/testsuite/gcc.dg/pch/valid-2.c
index 3d8cb14..15a57c9 100644
--- a/gcc/testsuite/gcc.dg/pch/valid-2.c
+++ b/gcc/testsuite/gcc.dg/pch/valid-2.c
@@ -1,5 +1,5 @@
 /* { dg-options "-I. -Winvalid-pch -fexceptions" } */
-
+/* { dg-require-effective-target exceptions } */
 #include "valid-2.h" /* { dg-warning "settings for -fexceptions do not match" } */
 /* { dg-error "No such file" "no such file" { target *-*-* } 0 } */
 /* { dg-error "they were invalid" "invalid files" { target *-*-* } 0 } */
diff --git a/gcc/testsuite/gcc.dg/pr41470.c b/gcc/testsuite/gcc.dg/pr41470.c
index 7ef0086..7374fac 100644
--- a/gcc/testsuite/gcc.dg/pr41470.c
+++ b/gcc/testsuite/gcc.dg/pr41470.c
@@ -1,6 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-fexceptions" } */
 /* { dg-require-effective-target alloca } */
+/* { dg-require-effective-target exceptions } */
 
 void cf (void *);
 
diff --git a/gcc/testsuite/gcc.dg/pr42427.c b/gcc/testsuite/gcc.dg/pr42427.c
index cb43dd2..cb290fe 100644
--- a/gcc/testsuite/gcc.dg/pr42427.c
+++ b/gcc/testsuite/gcc.dg/pr42427.c
@@ -2,6 +2,7 @@
 /* { dg-options "-O2 -fexceptions -fnon-call-exceptions -fpeel-loops" } */
 /* { dg-add-options c99_runtime } */
 /* { dg-require-effective-target ilp32 } */
+/* { dg-require-effective-target exceptions } */
 
 #include <complex.h>
 
diff --git a/gcc/testsuite/gcc.dg/pr44545.c b/gcc/testsuite/gcc.dg/pr44545.c
index 8058261..37f75f1 100644
--- a/gcc/testsuite/gcc.dg/pr44545.c
+++ b/gcc/testsuite/gcc.dg/pr44545.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fnon-call-exceptions -ftrapv -fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 void
 DrawChunk(int *tabSize, int x) 
 {
diff --git a/gcc/testsuite/gcc.dg/pr47086.c b/gcc/testsuite/gcc.dg/pr47086.c
index 71743fe..473e802 100644
--- a/gcc/testsuite/gcc.dg/pr47086.c
+++ b/gcc/testsuite/gcc.dg/pr47086.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O -fexceptions -fnon-call-exceptions -ftrapv" } */
+/* { dg-require-effective-target exceptions } */
 
 void
 foo ()
diff --git a/gcc/testsuite/gcc.dg/pr51481.c b/gcc/testsuite/gcc.dg/pr51481.c
index d883d47..a35f8f3 100644
--- a/gcc/testsuite/gcc.dg/pr51481.c
+++ b/gcc/testsuite/gcc.dg/pr51481.c
@@ -1,6 +1,7 @@
 /* PR tree-optimization/51481 */
 /* { dg-do compile } */
 /* { dg-options "-O -fexceptions -fipa-cp -fipa-cp-clone" } */
+/* { dg-require-effective-target exceptions } */
 
 extern const unsigned short int **foo (void)
   __attribute__ ((__nothrow__, __const__));
diff --git a/gcc/testsuite/gcc.dg/pr51644.c b/gcc/testsuite/gcc.dg/pr51644.c
index 2038a0c..e23c02f 100644
--- a/gcc/testsuite/gcc.dg/pr51644.c
+++ b/gcc/testsuite/gcc.dg/pr51644.c
@@ -1,6 +1,7 @@
 /* PR middle-end/51644 */
 /* { dg-do compile } */
 /* { dg-options "-Wall -fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 #include <stdarg.h>
 
diff --git a/gcc/testsuite/gcc.dg/pr52046.c b/gcc/testsuite/gcc.dg/pr52046.c
index e72061f..f0873e2 100644
--- a/gcc/testsuite/gcc.dg/pr52046.c
+++ b/gcc/testsuite/gcc.dg/pr52046.c
@@ -1,6 +1,7 @@
 /* PR tree-optimization/52046 */
 /* { dg-do compile } */
 /* { dg-options "-O3 -fexceptions -fnon-call-exceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 extern float a[], b[], c[], d[];
 extern int k[];
diff --git a/gcc/testsuite/gcc.dg/pr54669.c b/gcc/testsuite/gcc.dg/pr54669.c
index b68c047..48967ed 100644
--- a/gcc/testsuite/gcc.dg/pr54669.c
+++ b/gcc/testsuite/gcc.dg/pr54669.c
@@ -3,6 +3,7 @@
 
 /* { dg-do compile } */
 /* { dg-options "-O2 -fexceptions -fnon-call-exceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 int a[10];
 
diff --git a/gcc/testsuite/gcc.dg/pr56424.c b/gcc/testsuite/gcc.dg/pr56424.c
index a724c64..7f28f04 100644
--- a/gcc/testsuite/gcc.dg/pr56424.c
+++ b/gcc/testsuite/gcc.dg/pr56424.c
@@ -2,6 +2,7 @@
 
 /* { dg-do compile } */
 /* { dg-options "-O2 -fexceptions -fnon-call-exceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 extern long double cosl (long double);
 extern long double sinl (long double);
diff --git a/gcc/testsuite/gcc.dg/pr64465.c b/gcc/testsuite/gcc.dg/pr64465.c
index acfa952..d1d1749 100644
--- a/gcc/testsuite/gcc.dg/pr64465.c
+++ b/gcc/testsuite/gcc.dg/pr64465.c
@@ -1,6 +1,7 @@
 /* PR tree-optimization/64465 */
 /* { dg-do compile } */
 /* { dg-options "-O2 -fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 extern int foo (int *);
 extern int bar (int, int);
diff --git a/gcc/testsuite/gcc.dg/pr65802.c b/gcc/testsuite/gcc.dg/pr65802.c
index fcec234..0721ca8 100644
--- a/gcc/testsuite/gcc.dg/pr65802.c
+++ b/gcc/testsuite/gcc.dg/pr65802.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O0 -fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 #include <stdarg.h>
 
diff --git a/gcc/testsuite/gcc.dg/pr67563.c b/gcc/testsuite/gcc.dg/pr67563.c
index 34a78a2..5a727b8 100644
--- a/gcc/testsuite/gcc.dg/pr67563.c
+++ b/gcc/testsuite/gcc.dg/pr67563.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 static void
 emit_package (int p1)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c b/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c
index 6be7cd9..eb8e1f2 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fexceptions -fdump-tree-optimized" } */
+/* { dg-require-effective-target exceptions } */
 
 void af (void *a);
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
index d35377b..d3a1bbc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-dse-details -fexceptions -fnon-call-exceptions -fno-isolate-erroneous-paths-dereference" } */
+/* { dg-require-effective-target exceptions } */
 
 
 int foo (int *p, int b)
diff --git a/gcc/testsuite/gcc.dg/vect/pr46663.c b/gcc/testsuite/gcc.dg/vect/pr46663.c
index 457ceae..c2e56bb 100644
--- a/gcc/testsuite/gcc.dg/vect/pr46663.c
+++ b/gcc/testsuite/gcc.dg/vect/pr46663.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-additional-options "-O -fexceptions" } */
+/* { dg-require-effective-target exceptions } */
 
 typedef __attribute__ ((const)) int (*bart) (void);
 
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index b51e8f0..e27bed0 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -8826,6 +8826,16 @@ proc check_effective_target_fenv_exceptions {} {
     } [add_options_for_ieee "-std=gnu99"]]
 }
 
+# Return 1 if -fexceptions is supported.
+
+proc check_effective_target_exceptions {} {
+    if { [istarget amdgcn*-*-*] } {
+	return 0
+    }
+    return 1
+}
+
+
 proc check_effective_target_tiny {} {
     global et_target_tiny_saved
 

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 24/25] Ignore LLVM's blank lines.
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (18 preceding siblings ...)
  2018-09-05 11:52 ` [PATCH 22/25] Add dg-require-effective-target exceptions ams
@ 2018-09-05 11:52 ` ams
  2018-09-14 16:19   ` Jeff Law
  2018-09-05 11:52 ` [PATCH 20/25] GCN libgcc ams
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 187+ messages in thread
From: ams @ 2018-09-05 11:52 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 886 bytes --]


The GCN toolchain must use the LLVM assembler and linker because there's no
binutils port.  The LLVM tools do not have the same diagnostic style as
binutils, so the "blank line(s) in output" tests are inappropriate (and very
noisy).

The LLVM tools also have different command line options, so it's not possible
to autodetect object formats in the same way.

This patch addresses both issues.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/testsuite/
	* lib/file-format.exp (gcc_target_object_format): Handle AMD GCN.
	* lib/gcc-dg.exp (gcc-dg-prune): Ignore blank lines from the LLVM
	linker.
	* lib/target-supports.exp (check_effective_target_llvm_binutils): New.
---
 gcc/testsuite/lib/file-format.exp     |  3 +++
 gcc/testsuite/lib/gcc-dg.exp          |  2 +-
 gcc/testsuite/lib/target-supports.exp | 14 ++++++++++++++
 3 files changed, 18 insertions(+), 1 deletion(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0024-Ignore-LLVM-s-blank-lines.patch --]
[-- Type: text/x-patch; name="0024-Ignore-LLVM-s-blank-lines.patch", Size: 2249 bytes --]

diff --git a/gcc/testsuite/lib/file-format.exp b/gcc/testsuite/lib/file-format.exp
index 5c47246..c595fe2 100644
--- a/gcc/testsuite/lib/file-format.exp
+++ b/gcc/testsuite/lib/file-format.exp
@@ -41,6 +41,9 @@ proc gcc_target_object_format { } {
     } elseif { [istarget *-*-aix*] } {
 	# AIX doesn't necessarily have objdump, so hand-code it.
 	set gcc_target_object_format_saved coff
+    } elseif { [istarget *-*-amdhsa*] } {
+	# AMD GCN uses LLVM objdump which is not CLI-compatible
+	set gcc_target_object_format_saved elf
     } else {
         set objdump_name [find_binutils_prog objdump]
         set open_file [open objfmtst.c w]
diff --git a/gcc/testsuite/lib/gcc-dg.exp b/gcc/testsuite/lib/gcc-dg.exp
index f5e6bef..7df348e 100644
--- a/gcc/testsuite/lib/gcc-dg.exp
+++ b/gcc/testsuite/lib/gcc-dg.exp
@@ -361,7 +361,7 @@ proc gcc-dg-prune { system text } {
 
     # Complain about blank lines in the output (PR other/69006)
     global allow_blank_lines
-    if { !$allow_blank_lines } {
+    if { !$allow_blank_lines && ![check_effective_target_llvm_binutils]} {
 	set num_blank_lines [llength [regexp -all -inline "\n\n" $text]]
 	if { $num_blank_lines } {
 	    global testname_with_flags
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 61442bd..1e627fa 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -9129,6 +9129,14 @@ proc check_effective_target_offload_hsa { } {
     } "-foffload=hsa" ]
 }
 
+# Return 1 if the compiler has been configured with hsa offloading.
+
+proc check_effective_target_offload_gcn { } {
+    return [check_no_compiler_messages offload_gcn assembly {
+	int main () {return 0;}
+    } "-foffload=amdgcn-unknown-amdhsa" ]
+}
+
 # Return 1 if the target support -fprofile-update=atomic
 proc check_effective_target_profile_update_atomic {} {
     return [check_no_compiler_messages profile_update_atomic assembly {
@@ -9427,3 +9435,9 @@ proc check_effective_target_cet { } {
 	}
     } "-O2" ]
 }
+
+# Return 1 if this target uses an LLVM assembler and/or linker
+proc check_effective_target_llvm_binutils { } {
+    return [expr { [istarget amdgcn*-*-*]
+		   || [check_effective_target_offload_gcn] } ]
+}

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 25/25] Port testsuite to GCN
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (22 preceding siblings ...)
  2018-09-05 11:52 ` [PATCH 23/25] Testsuite: GCN is always PIE ams
@ 2018-09-05 11:53 ` ams
  2018-09-05 13:40 ` [PATCH 21/25] GCN Back-end (part 1/2) Andrew Stubbs
  2018-09-05 13:43 ` [PATCH 21/25] GCN Back-end (part 2/2) Andrew Stubbs
  25 siblings, 0 replies; 187+ messages in thread
From: ams @ 2018-09-05 11:53 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4623 bytes --]


This collection of miscellaneous patches configures the testsuite to run on AMD
GCN in a standalone (i.e. not offloading) configuration.  It assumes you have
your Dejagnu set up to run binaries via the gcn-run tool.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
	    Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	gcc/testsuite/
	* gcc.dg/20020312-2.c: Add amdgcn support.
	* gcc.dg/Wno-frame-address.c: Disable on amdgcn.
	* gcc.dg/builtin-apply2.c: Likewise.
	* gcc.dg/torture/stackalign/builtin-apply-2.c: Likewise.
	* gcc.dg/gimplefe-28.c: Force -ffast-math.
	* gcc.dg/intermod-1.c: Add -mlocal-symbol-id on amdgcn.
	* gcc.dg/memcmp-1.c: Increase timeout factor.
	* gcc.dg/pr59605-2.c: Addd -DMAX_COPY=1025 on amdgcn.
	* gcc.dg/sibcall-10.c: xfail on amdgcn.
	* gcc.dg/sibcall-9.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-11c.c: Likewise.
	* gcc.dg/tree-ssa/pr84512.c: Likewise.
	* gcc.dg/tree-ssa/loop-1.c: Adjust expectations for amdgcn.
	* gfortran.dg/bind_c_array_params_2.f90: Likewise.
	* gcc.dg/vect/tree-vect.h: Avoid signal on amdgcn.
	* lib/target-supports.exp (check_effective_target_trampolines):
	Configure amdgcn.
	(check_profiling_available): Likewise.
	(check_effective_target_global_constructor): Likewise.
	(check_effective_target_return_address): Likewise.
	(check_effective_target_fopenacc): Likewise.
	(check_effective_target_fopenmp): Likewise.
	(check_effective_target_vect_int): Likewise.
	(check_effective_target_vect_intfloat_cvt): Likewise.
	(check_effective_target_vect_uintfloat_cvt): Likewise.
	(check_effective_target_vect_floatint_cvt): Likewise.
	(check_effective_target_vect_floatuint_cvt): Likewise.
	(check_effective_target_vect_simd_clones): Likewise.
	(check_effective_target_vect_shift): Likewise.
	(check_effective_target_whole_vector_shift): Likewise.
	(check_effective_target_vect_bswap): Likewise.
	(check_effective_target_vect_shift_char): Likewise.
	(check_effective_target_vect_long): Likewise.
	(check_effective_target_vect_float): Likewise.
	(check_effective_target_vect_double): Likewise.
	(check_effective_target_vect_perm): Likewise.
	(check_effective_target_vect_perm_byte): Likewise.
	(check_effective_target_vect_perm_short): Likewise.
	(check_effective_target_vect_widen_mult_qi_to_hi): Likewise.
	(check_effective_target_vect_widen_mult_hi_to_si): Likewise.
	(check_effective_target_vect_widen_mult_qi_to_hi_pattern): Likewise.
	(check_effective_target_vect_widen_mult_hi_to_si_pattern): Likewise.
	(check_effective_target_vect_natural_alignment): Likewise.
	(check_effective_target_vect_fully_masked): Likewise.
	(check_effective_target_vect_element_align): Likewise.
	(check_effective_target_vect_masked_store): Likewise.
	(check_effective_target_vect_scatter_store): Likewise.
	(check_effective_target_vect_condition): Likewise.
	(check_effective_target_vect_cond_mixed): Likewise.
	(check_effective_target_vect_char_mult): Likewise.
	(check_effective_target_vect_short_mult): Likewise.
	(check_effective_target_vect_int_mult): Likewise.
	(check_effective_target_sqrt_insn): Likewise.
	(check_effective_target_vect_call_sqrtf): Likewise.
	(check_effective_target_vect_call_btrunc): Likewise.
	(check_effective_target_vect_call_btruncf): Likewise.
	(check_effective_target_vect_call_ceil): Likewise.
	(check_effective_target_vect_call_floorf): Likewise.
	(check_effective_target_lto): Likewise.
	(check_vect_support_and_set_flags): Likewise.
	(check_effective_target_vect_stridedN): Enable when fully masked is
	available.
---
 gcc/testsuite/gcc.dg/20020312-2.c                  |   2 +
 gcc/testsuite/gcc.dg/Wno-frame-address.c           |   2 +-
 gcc/testsuite/gcc.dg/builtin-apply2.c              |   2 +-
 gcc/testsuite/gcc.dg/gimplefe-28.c                 |   2 +-
 gcc/testsuite/gcc.dg/intermod-1.c                  |   1 +
 gcc/testsuite/gcc.dg/memcmp-1.c                    |   1 +
 gcc/testsuite/gcc.dg/pr59605-2.c                   |   2 +-
 gcc/testsuite/gcc.dg/sibcall-10.c                  |   2 +-
 gcc/testsuite/gcc.dg/sibcall-9.c                   |   2 +-
 .../gcc.dg/torture/stackalign/builtin-apply-2.c    |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/gen-vect-11c.c       |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/loop-1.c             |   6 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr84512.c            |   2 +-
 gcc/testsuite/gcc.dg/vect/tree-vect.h              |   4 +
 .../gfortran.dg/bind_c_array_params_2.f90          |   3 +-
 gcc/testsuite/lib/target-supports.exp              | 126 +++++++++++++++------
 16 files changed, 113 insertions(+), 48 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0025-Port-testsuite-to-GCN.patch --]
[-- Type: text/x-patch; name="0025-Port-testsuite-to-GCN.patch", Size: 26694 bytes --]

diff --git a/gcc/testsuite/gcc.dg/20020312-2.c b/gcc/testsuite/gcc.dg/20020312-2.c
index f8be3ce..c88fdf3 100644
--- a/gcc/testsuite/gcc.dg/20020312-2.c
+++ b/gcc/testsuite/gcc.dg/20020312-2.c
@@ -116,6 +116,8 @@ extern void abort (void);
 # if defined (__CK807__) || defined (__CK810__)
 #   define PIC_REG  "r28"
 # endif
+#elif defined (__AMDGCN__)
+/* No pic register.  */
 #else
 # error "Modify the test for your target."
 #endif
diff --git a/gcc/testsuite/gcc.dg/Wno-frame-address.c b/gcc/testsuite/gcc.dg/Wno-frame-address.c
index 9fe4d07..5e3ef7a 100644
--- a/gcc/testsuite/gcc.dg/Wno-frame-address.c
+++ b/gcc/testsuite/gcc.dg/Wno-frame-address.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-skip-if "Cannot access arbitrary stack frames" { arm*-*-* avr-*-* hppa*-*-* ia64-*-* visium-*-* csky-*-* } } */
+/* { dg-skip-if "Cannot access arbitrary stack frames" { arm*-*-* amdgpu-*-* avr-*-* hppa*-*-* ia64-*-* visium-*-* csky-*-* } } */
 /* { dg-options "-Werror" } */
 /* { dg-additional-options "-mbackchain" { target { s390*-*-* } } } */
 
diff --git a/gcc/testsuite/gcc.dg/builtin-apply2.c b/gcc/testsuite/gcc.dg/builtin-apply2.c
index 3768caa..aca3b1f 100644
--- a/gcc/testsuite/gcc.dg/builtin-apply2.c
+++ b/gcc/testsuite/gcc.dg/builtin-apply2.c
@@ -1,6 +1,6 @@
 /* { dg-do run } */
 /* { dg-require-effective-target untyped_assembly } */
-/* { dg-skip-if "Variadic funcs have all args on stack. Normal funcs have args in registers." { "avr-*-* nds32*-*-*" } } */
+/* { dg-skip-if "Variadic funcs have all args on stack. Normal funcs have args in registers." { "avr-*-* nds32*-*-* amdgcn-*-*" } } */
 /* { dg-skip-if "Variadic funcs use different argument passing from normal funcs." { "riscv*-*-*" } } */
 /* { dg-skip-if "Variadic funcs use Base AAPCS.  Normal funcs use VFP variant." { arm*-*-* && arm_hf_eabi } } */
 
diff --git a/gcc/testsuite/gcc.dg/gimplefe-28.c b/gcc/testsuite/gcc.dg/gimplefe-28.c
index 467172d..57b6e1f 100644
--- a/gcc/testsuite/gcc.dg/gimplefe-28.c
+++ b/gcc/testsuite/gcc.dg/gimplefe-28.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target sqrt_insn } } */
-/* { dg-options "-fgimple -O2" } */
+/* { dg-options "-fgimple -O2 -ffast-math" } */
 
 double __GIMPLE
 f1 (double x)
diff --git a/gcc/testsuite/gcc.dg/intermod-1.c b/gcc/testsuite/gcc.dg/intermod-1.c
index 9f8d19d..44a8ce0 100644
--- a/gcc/testsuite/gcc.dg/intermod-1.c
+++ b/gcc/testsuite/gcc.dg/intermod-1.c
@@ -1,4 +1,5 @@
 /* { dg-do compile } */
+/* { dg-additional-options "-mlocal-symbol-id=" { target amdgcn-*-* } } */
 /* { dg-final { scan-assembler-not {foo[1-9]\.[0-9]} } } */
 
 /* Check that we don't get .0 suffixes on static variables when not using
diff --git a/gcc/testsuite/gcc.dg/memcmp-1.c b/gcc/testsuite/gcc.dg/memcmp-1.c
index 619cf9b..ea837ca 100644
--- a/gcc/testsuite/gcc.dg/memcmp-1.c
+++ b/gcc/testsuite/gcc.dg/memcmp-1.c
@@ -2,6 +2,7 @@
 /* { dg-do run } */
 /* { dg-options "-O2" } */
 /* { dg-require-effective-target ptr32plus } */
+/* { dg-timeout-factor 2 } */
 
 #include <stdio.h>
 #include <stdlib.h>
diff --git a/gcc/testsuite/gcc.dg/pr59605-2.c b/gcc/testsuite/gcc.dg/pr59605-2.c
index 6d6ff23..9575481 100644
--- a/gcc/testsuite/gcc.dg/pr59605-2.c
+++ b/gcc/testsuite/gcc.dg/pr59605-2.c
@@ -1,6 +1,6 @@
 /* { dg-do run } */
 /* { dg-options "-O2" } */
-/* { dg-additional-options "-DMAX_COPY=1025" { target { { simulator } || { nvptx-*-* } } } } */
+/* { dg-additional-options "-DMAX_COPY=1025" { target { { simulator } || { nvptx-*-* amdgcn*-*-* } } } } */
 /* { dg-additional-options "-minline-stringops-dynamically" { target { i?86-*-* x86_64-*-* } } } */
 
 #include "pr59605.c"
diff --git a/gcc/testsuite/gcc.dg/sibcall-10.c b/gcc/testsuite/gcc.dg/sibcall-10.c
index 54cc604..f3e0a9b 100644
--- a/gcc/testsuite/gcc.dg/sibcall-10.c
+++ b/gcc/testsuite/gcc.dg/sibcall-10.c
@@ -5,7 +5,7 @@
    Copyright (C) 2002 Free Software Foundation Inc.
    Contributed by Hans-Peter Nilsson  <hp@bitrange.com>  */
 
-/* { dg-do run { xfail { { cris-*-* crisv32-*-* csky-*-* h8300-*-* hppa*64*-*-* m32r-*-* mcore-*-* mn10300-*-* msp430*-*-* nds32*-*-* xstormy16-*-* v850*-*-* vax-*-* xtensa*-*-* } || { arm*-*-* && { ! arm32 } } } } } */
+/* { dg-do run { xfail { { amdgcn*-*-* cris-*-* crisv32-*-* csky-*-* h8300-*-* hppa*64*-*-* m32r-*-* mcore-*-* mn10300-*-* msp430*-*-* nds32*-*-* xstormy16-*-* v850*-*-* vax-*-* xtensa*-*-* } || { arm*-*-* && { ! arm32 } } } } } */
 /* -mlongcall disables sibcall patterns.  */
 /* { dg-skip-if "" { powerpc*-*-* } { "-mlongcall" } { "" } } */
 /* -msave-restore disables sibcall patterns.  */
diff --git a/gcc/testsuite/gcc.dg/sibcall-9.c b/gcc/testsuite/gcc.dg/sibcall-9.c
index fc3bd9d..adb2ca3 100644
--- a/gcc/testsuite/gcc.dg/sibcall-9.c
+++ b/gcc/testsuite/gcc.dg/sibcall-9.c
@@ -5,7 +5,7 @@
    Copyright (C) 2002 Free Software Foundation Inc.
    Contributed by Hans-Peter Nilsson  <hp@bitrange.com>  */
 
-/* { dg-do run { xfail { { cris-*-* crisv32-*-* csky-*-* h8300-*-* hppa*64*-*-* m32r-*-* mcore-*-* mn10300-*-* msp430*-*-* nds32*-*-* nvptx-*-* xstormy16-*-* v850*-*-* vax-*-* xtensa*-*-* } || { arm*-*-* && { ! arm32 } } } } } */
+/* { dg-do run { xfail { { amdgcn*-*-* cris-*-* crisv32-*-* csky-*-* h8300-*-* hppa*64*-*-* m32r-*-* mcore-*-* mn10300-*-* msp430*-*-* nds32*-*-* nvptx-*-* xstormy16-*-* v850*-*-* vax-*-* xtensa*-*-* } || { arm*-*-* && { ! arm32 } } } } } */
 /* -mlongcall disables sibcall patterns.  */
 /* { dg-skip-if "" { powerpc*-*-* } { "-mlongcall" } { "" } } */
 /* -msave-restore disables sibcall patterns.  */
diff --git a/gcc/testsuite/gcc.dg/torture/stackalign/builtin-apply-2.c b/gcc/testsuite/gcc.dg/torture/stackalign/builtin-apply-2.c
index d033010..669ab9a 100644
--- a/gcc/testsuite/gcc.dg/torture/stackalign/builtin-apply-2.c
+++ b/gcc/testsuite/gcc.dg/torture/stackalign/builtin-apply-2.c
@@ -9,7 +9,7 @@
 /* arm_hf_eabi: Variadic funcs use Base AAPCS.  Normal funcs use VFP variant.
    avr: Variadic funcs don't pass arguments in registers, while normal funcs
         do.  */
-/* { dg-skip-if "Variadic funcs use different argument passing from normal funcs" { arm_hf_eabi || { avr-*-* riscv*-*-* } } } */
+/* { dg-skip-if "Variadic funcs use different argument passing from normal funcs" { arm_hf_eabi || { avr-*-* riscv*-*-* amdgcn-*-* } } } */
 /* { dg-skip-if "Variadic funcs have all args on stack. Normal funcs have args in registers." { nds32*-*-* } } */
 /* { dg-require-effective-target untyped_assembly } */
    
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-11c.c b/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-11c.c
index 236d3a5..22ff44c 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-11c.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-11c.c
@@ -39,4 +39,4 @@ int main ()
 }
 
 
-/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail amdgcn*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-1.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-1.c
index 1862750..f422f39 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/loop-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-1.c
@@ -45,8 +45,10 @@ int xxx(void)
    relaxation.  */
 /* CRIS keeps the address in a register.  */
 /* m68k sometimes puts the address in a register, depending on CPU and PIC.  */
+/* AMD GCN loads symbol addresses as hi/lo pairs, and then reuses that for
+   each jump.  */
 
-/* { dg-final { scan-assembler-times "foo" 5 { xfail hppa*-*-* ia64*-*-* sh*-*-* cris-*-* crisv32-*-* fido-*-* m68k-*-* i?86-*-mingw* i?86-*-cygwin* x86_64-*-mingw* visium-*-* nvptx*-*-* } } } */
+/* { dg-final { scan-assembler-times "foo" 5 { xfail hppa*-*-* ia64*-*-* sh*-*-* cris-*-* crisv32-*-* fido-*-* m68k-*-* i?86-*-mingw* i?86-*-cygwin* x86_64-*-mingw* visium-*-* nvptx*-*-* amdgcn*-*-* } } } */
 /* { dg-final { scan-assembler-times "foo,%r" 5 { target hppa*-*-* } } } */
 /* { dg-final { scan-assembler-times "= foo"  5 { target ia64*-*-* } } } */
 /* { dg-final { scan-assembler-times "call\[ \t\]*_foo" 5 { target i?86-*-mingw* i?86-*-cygwin* } } } */
@@ -56,3 +58,5 @@ int xxx(void)
 /* { dg-final { scan-assembler-times "\[jb\]sr" 5 { target fido-*-* m68k-*-* } } } */
 /* { dg-final { scan-assembler-times "bra *tr,r\[1-9\]*,r21" 5 { target visium-*-* } } } */
 /* { dg-final { scan-assembler-times "(?n)\[ \t\]call\[ \t\].*\[ \t\]foo," 5 { target nvptx*-*-* } } } */
+/* { dg-final { scan-assembler-times "add_u32\t\[sv\]\[0-9\]*, \[sv\]\[0-9\]*, foo@rel32@lo" 1 { target { amdgcn*-*-* } } } } */
+/* { dg-final { scan-assembler-times "s_swappc_b64" 5 { target { amdgcn*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
index 056d1c4..3975757 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
@@ -13,4 +13,4 @@ int foo()
 }
 
 /* Listed targets xfailed due to PR84958.  */
-/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { { alpha*-*-* nvptx*-*-* } || { sparc*-*-* && lp64 } } } } } */
+/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { { alpha*-*-* amdgcn*-*-* nvptx*-*-* } || { sparc*-*-* && lp64 } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/tree-vect.h b/gcc/testsuite/gcc.dg/vect/tree-vect.h
index 69c93ac..2ddfa5e 100644
--- a/gcc/testsuite/gcc.dg/vect/tree-vect.h
+++ b/gcc/testsuite/gcc.dg/vect/tree-vect.h
@@ -1,5 +1,9 @@
 /* Check if system supports SIMD */
+#ifdef __AMDGCN__
+#define signal(A,B)
+#else
 #include <signal.h>
+#endif
 
 #if defined(__i386__) || defined(__x86_64__)
 # include "cpuid.h"
diff --git a/gcc/testsuite/gfortran.dg/bind_c_array_params_2.f90 b/gcc/testsuite/gfortran.dg/bind_c_array_params_2.f90
index 25f5dda..34ed055 100644
--- a/gcc/testsuite/gfortran.dg/bind_c_array_params_2.f90
+++ b/gcc/testsuite/gfortran.dg/bind_c_array_params_2.f90
@@ -16,8 +16,9 @@ integer :: aa(4,4)
 call test(aa)
 end
 
-! { dg-final { scan-assembler-times "\[ \t\]\[$,_0-9\]*myBindC" 1 { target { ! { hppa*-*-* s390*-*-* *-*-cygwin* } } } } }
+! { dg-final { scan-assembler-times "\[ \t\]\[$,_0-9\]*myBindC" 1 { target { ! { hppa*-*-* s390*-*-* *-*-cygwin* amdgcn*-*-* } } } } }
 ! { dg-final { scan-assembler-times "myBindC,%r2" 1 { target { hppa*-*-* } } } }
 ! { dg-final { scan-assembler-times "call\tmyBindC" 1 { target { *-*-cygwin* } } } }
 ! { dg-final { scan-assembler-times "brasl\t%r\[0-9\]*,myBindC" 1 { target { s390*-*-* } } } }
+! { dg-final { scan-assembler-times "add_u32\t\[sv\]\[0-9\]*, \[sv\]\[0-9\]*, myBindC@rel32@lo" 1 { target { amdgcn*-*-* } } } }
 ! { dg-final { scan-tree-dump-times "test \\\(&parm\\." 1 "original" } }
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 1e627fa..bbb2e1f 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -662,6 +662,7 @@ proc check_profiling_available { test_what } {
 	# missing other needed machinery.
 	if {[istarget aarch64*-*-elf]
 	     || [istarget am3*-*-linux*]
+	     || [istarget amdgcn-*-*]
 	     || [istarget arm*-*-eabi*]
 	     || [istarget arm*-*-elf]
 	     || [istarget arm*-*-symbianelf*]
@@ -788,6 +789,9 @@ proc check_effective_target_global_constructor {} {
     if { [istarget nvptx-*-*] } {
 	return 0
     }
+    if { [istarget amdgcn-*-*] } {
+	return 0
+    }
     return 1
 }
 
@@ -808,6 +812,10 @@ proc check_effective_target_return_address {} {
     if { [istarget nvptx-*-*] } {
 	return 0
     }
+    # It could be supported on amdgcn, but isn't yet.
+    if { [istarget amdgcn*-*-*] } {
+	return 0
+    }
     return 1
 }
 
@@ -954,9 +962,10 @@ proc check_effective_target_fgraphite {} {
 # code, 0 otherwise.
 
 proc check_effective_target_fopenacc {} {
-    # nvptx can be built with the device-side bits of openacc, but it
+    # nvptx/amdgcn can be built with the device-side bits of openacc, but it
     # does not make sense to test it as an openacc host.
     if [istarget nvptx-*-*] { return 0 }
+    if [istarget amdgcn-*-*] { return 0 }
 
     return [check_no_compiler_messages fopenacc object {
 	void foo (void) { }
@@ -967,9 +976,10 @@ proc check_effective_target_fopenacc {} {
 # code, 0 otherwise.
 
 proc check_effective_target_fopenmp {} {
-    # nvptx can be built with the device-side bits of libgomp, but it
+    # nvptx/amdgcn can be built with the device-side bits of libgomp, but it
     # does not make sense to test it as an openmp host.
     if [istarget nvptx-*-*] { return 0 }
+    if [istarget amdgcn-*-*] { return 0 }
 
     return [check_no_compiler_messages fopenmp object {
 	void foo (void) { }
@@ -3107,6 +3117,7 @@ proc check_effective_target_vect_int { } {
 	if { [istarget i?86-*-*] || [istarget x86_64-*-*]
              || ([istarget powerpc*-*-*]
 		 && ![istarget powerpc-*-linux*paired*])
+	     || [istarget amdgcn-*-*]
 	     || [istarget spu-*-*]
 	     || [istarget sparc*-*-*]
 	     || [istarget alpha*-*-*]
@@ -3144,7 +3155,8 @@ proc check_effective_target_vect_intfloat_cvt { } {
 		 && ![istarget powerpc-*-linux*paired*])
 	     || [is-effective-target arm_neon]
 	     || ([istarget mips*-*-*]
-		 && [et-is-effective-target mips_msa]) } {
+		 && [et-is-effective-target mips_msa])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_intfloat_cvt_saved($et_index) 1
         }
     }
@@ -3248,7 +3260,8 @@ proc check_effective_target_vect_uintfloat_cvt { } {
 	     || [istarget aarch64*-*-*]
 	     || [is-effective-target arm_neon]
 	     || ([istarget mips*-*-*]
-		 && [et-is-effective-target mips_msa]) } {
+		 && [et-is-effective-target mips_msa])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_uintfloat_cvt_saved($et_index) 1
         }
     }
@@ -3276,7 +3289,8 @@ proc check_effective_target_vect_floatint_cvt { } {
 		 && ![istarget powerpc-*-linux*paired*])
 	     || [is-effective-target arm_neon]
 	     || ([istarget mips*-*-*]
-		 && [et-is-effective-target mips_msa]) } {
+		 && [et-is-effective-target mips_msa])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_floatint_cvt_saved($et_index) 1
         }
     }
@@ -3302,7 +3316,8 @@ proc check_effective_target_vect_floatuint_cvt { } {
 	      && ![istarget powerpc-*-linux*paired*])
 	    || [is-effective-target arm_neon]
 	    || ([istarget mips*-*-*]
-		&& [et-is-effective-target mips_msa]) } {
+		&& [et-is-effective-target mips_msa])
+	    || [istarget amdgcn-*-*] } {
 	   set et_vect_floatuint_cvt_saved($et_index) 1
         }
     }
@@ -3352,7 +3367,8 @@ proc check_effective_target_vect_simd_clones { } {
 	# specified arch will be chosen, but still we need to at least
 	# be able to assemble avx512f.
 	if { (([istarget i?86-*-*] || [istarget x86_64-*-*])
-	      && [check_effective_target_avx512f]) } {
+	      && [check_effective_target_avx512f])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_simd_clones_saved($et_index) 1
 	}
     }
@@ -5462,7 +5478,8 @@ proc check_effective_target_vect_shift { } {
 		 && ([et-is-effective-target mips_msa]
 		     || [et-is-effective-target mips_loongson]))
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_shift_saved($et_index) 1
 	}
     }
@@ -5482,7 +5499,8 @@ proc check_effective_target_whole_vector_shift { } {
 	 || ([istarget mips*-*-*]
 	     && [et-is-effective-target mips_loongson])
 	 || ([istarget s390*-*-*]
-	     && [check_effective_target_s390_vx]) } {
+	     && [check_effective_target_s390_vx])
+	 || [istarget amdgcn-*-*] } {
 	set answer 1
     } else {
 	set answer 0
@@ -5504,6 +5522,7 @@ proc check_effective_target_vect_bswap { } {
 	set et_vect_bswap_saved($et_index) 0
 	if { [istarget aarch64*-*-*]
              || [is-effective-target arm_neon]
+	     || [istarget amdgcn-*-*]
 	   } {
 	   set et_vect_bswap_saved($et_index) 1
 	}
@@ -5530,7 +5549,8 @@ proc check_effective_target_vect_shift_char { } {
 	     || ([istarget mips*-*-*]
 		 && [et-is-effective-target mips_msa])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_shift_char_saved($et_index) 1
 	}
     }
@@ -5555,7 +5575,8 @@ proc check_effective_target_vect_long { } {
 	 || ([istarget mips*-*-*]
 	      && [et-is-effective-target mips_msa])
 	 || ([istarget s390*-*-*]
-	     && [check_effective_target_s390_vx]) } {
+	     && [check_effective_target_s390_vx])
+	 || [istarget amdgcn-*-*] } {
 	set answer 1
     } else {
 	set answer 0
@@ -5589,7 +5610,8 @@ proc check_effective_target_vect_float { } {
 		 && [et-is-effective-target mips_msa])
 	     || [is-effective-target arm_neon]
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vxe]) } {
+		 && [check_effective_target_s390_vxe])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_float_saved($et_index) 1
 	}
     }
@@ -5631,7 +5653,8 @@ proc check_effective_target_vect_double { } {
 	     || ([istarget mips*-*-*]
 		 && [et-is-effective-target mips_msa])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_double_saved($et_index) 1
 	}
     }
@@ -5767,7 +5790,8 @@ proc check_effective_target_vect_perm { } {
 		 && ([et-is-effective-target mpaired_single]
 		     || [et-is-effective-target mips_msa]))
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_perm_saved($et_index) 1
         }
     }
@@ -5872,7 +5896,8 @@ proc check_effective_target_vect_perm_byte { } {
 	     || ([istarget mips-*.*]
 		 && [et-is-effective-target mips_msa])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_perm_byte_saved($et_index) 1
         }
     }
@@ -5913,7 +5938,8 @@ proc check_effective_target_vect_perm_short { } {
 	     || ([istarget mips*-*-*]
 		  && [et-is-effective-target mips_msa])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_perm_short_saved($et_index) 1
         }
     }
@@ -6084,7 +6110,8 @@ proc check_effective_target_vect_widen_mult_qi_to_hi { } {
 		  && ![check_effective_target_aarch64_sve])
               || [is-effective-target arm_neon]
 	      || ([istarget s390*-*-*]
-		  && [check_effective_target_s390_vx]) } {
+		  && [check_effective_target_s390_vx])
+	      || [istarget amdgcn-*-*] } {
 	    set et_vect_widen_mult_qi_to_hi_saved($et_index) 1
         }
     }
@@ -6124,7 +6151,8 @@ proc check_effective_target_vect_widen_mult_hi_to_si { } {
 	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
 	     || [is-effective-target arm_neon]
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_widen_mult_hi_to_si_saved($et_index) 1
         }
     }
@@ -6151,7 +6179,8 @@ proc check_effective_target_vect_widen_mult_qi_to_hi_pattern { } {
               || ([is-effective-target arm_neon]
 		  && [check_effective_target_arm_little_endian])
 	      || ([istarget s390*-*-*]
-		  && [check_effective_target_s390_vx]) } {
+		  && [check_effective_target_s390_vx])
+	      || [istarget amdgcn-*-*] } {
 	    set et_vect_widen_mult_qi_to_hi_pattern_saved($et_index) 1
         }
     }
@@ -6181,7 +6210,8 @@ proc check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
 	     || ([is-effective-target arm_neon]
 		 && [check_effective_target_arm_little_endian])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	    set et_vect_widen_mult_hi_to_si_pattern_saved($et_index) 1
         }
     }
@@ -6578,7 +6608,8 @@ proc check_effective_target_vect_natural_alignment { } {
     set et_vect_natural_alignment 1
     if { [check_effective_target_arm_eabi]
 	 || [istarget nvptx-*-*]
-	 || [istarget s390*-*-*] } {
+	 || [istarget s390*-*-*]
+	 || [istarget amdgcn-*-*] } {
 	set et_vect_natural_alignment 0
     }
     verbose "check_effective_target_vect_natural_alignment:\
@@ -6589,7 +6620,8 @@ proc check_effective_target_vect_natural_alignment { } {
 # Return true if fully-masked loops are supported.
 
 proc check_effective_target_vect_fully_masked { } {
-    return [check_effective_target_aarch64_sve]
+    return [expr { [check_effective_target_aarch64_sve]
+	           || [istarget amdgcn*-*-*] }]
 }
 
 # Return 1 if the target doesn't prefer any alignment beyond element
@@ -6648,7 +6680,8 @@ proc check_effective_target_vect_element_align { } {
 	set et_vect_element_align($et_index) 0
 	if { ([istarget arm*-*-*]
 	      && ![check_effective_target_arm_vect_no_misalign])
-	     || [check_effective_target_vect_hw_misalign] } {
+	     || [check_effective_target_vect_hw_misalign]
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_element_align($et_index) 1
 	}
     }
@@ -6690,13 +6723,15 @@ proc check_effective_target_vect_load_lanes { } {
 # Return 1 if the target supports vector masked stores.
 
 proc check_effective_target_vect_masked_store { } {
-    return [check_effective_target_aarch64_sve]
+    return [expr { [check_effective_target_aarch64_sve]
+		   || [istarget amdgcn*-*-*] }]
 }
 
 # Return 1 if the target supports vector scatter stores.
 
 proc check_effective_target_vect_scatter_store { } {
-    return [check_effective_target_aarch64_sve]
+    return [expr { [check_effective_target_aarch64_sve]
+		   || [istarget amdgcn*-*-*] }]
 }
 
 # Return 1 if the target supports vector conditional operations, 0 otherwise.
@@ -6719,7 +6754,8 @@ proc check_effective_target_vect_condition { } {
 	     || ([istarget arm*-*-*]
 		 && [check_effective_target_arm_neon_ok])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_cond_saved($et_index) 1
 	}
     }
@@ -6746,7 +6782,8 @@ proc check_effective_target_vect_cond_mixed { } {
 	     || ([istarget mips*-*-*]
 		 && [et-is-effective-target mips_msa])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_cond_mixed_saved($et_index) 1
 	}
     }
@@ -6774,7 +6811,8 @@ proc check_effective_target_vect_char_mult { } {
 	     || ([istarget mips*-*-*]
 		 && [et-is-effective-target mips_msa])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_char_mult_saved($et_index) 1
 	}
     }
@@ -6804,7 +6842,8 @@ proc check_effective_target_vect_short_mult { } {
 		 && ([et-is-effective-target mips_msa]
 		     || [et-is-effective-target mips_loongson]))
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_short_mult_saved($et_index) 1
 	}
     }
@@ -6833,7 +6872,8 @@ proc check_effective_target_vect_int_mult { } {
 		 && [et-is-effective-target mips_msa])
 	     || [check_effective_target_arm32]
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_vect_int_mult_saved($et_index) 1
 	}
     }
@@ -6949,6 +6989,9 @@ foreach N {2 3 4 8} {
 		      || [istarget aarch64*-*-*]) && N >= 2 && N <= 4 } {
 		    set et_vect_stridedN_saved($et_index) 1
 		}
+		if [check_effective_target_vect_fully_masked] {
+		    set et_vect_stridedN_saved($et_index) 1
+		}
 	    }
 
 	    verbose "check_effective_target_vect_stridedN:\
@@ -7038,7 +7081,8 @@ proc check_effective_target_sqrt_insn { } {
 	     || [istarget aarch64*-*-*]
 	     || ([istarget arm*-*-*] && [check_effective_target_arm_vfp_ok])
 	     || ([istarget s390*-*-*]
-		 && [check_effective_target_s390_vx]) } {
+		 && [check_effective_target_s390_vx])
+	     || [istarget amdgcn-*-*] } {
 	   set et_sqrt_insn_saved 1
 	}
     }
@@ -7076,7 +7120,8 @@ proc check_effective_target_vect_call_sqrtf { } {
 proc check_effective_target_vect_call_lrint { } {
     set et_vect_call_lrint 0
     if { (([istarget i?86-*-*] || [istarget x86_64-*-*])
-	  && [check_effective_target_ilp32]) } {
+	  && [check_effective_target_ilp32])
+	 || [istarget amdgcn-*-*] } {
 	set et_vect_call_lrint 1
     }
 
@@ -7095,7 +7140,8 @@ proc check_effective_target_vect_call_btrunc { } {
 		 using cached result" 2
     } else {
 	set et_vect_call_btrunc_saved($et_index) 0
-	if { [istarget aarch64*-*-*] } {
+	if { [istarget aarch64*-*-*]
+	     || [istarget amdgcn-*-*] } {
 	  set et_vect_call_btrunc_saved($et_index) 1
 	}
     }
@@ -7116,7 +7162,8 @@ proc check_effective_target_vect_call_btruncf { } {
 		 using cached result" 2
     } else {
 	set et_vect_call_btruncf_saved($et_index) 0
-	if { [istarget aarch64*-*-*] } {
+	if { [istarget aarch64*-*-*]
+	     || [istarget amdgcn-*-*] } {
 	  set et_vect_call_btruncf_saved($et_index) 1
 	}
     }
@@ -7136,7 +7183,8 @@ proc check_effective_target_vect_call_ceil { } {
 	verbose "check_effective_target_vect_call_ceil: using cached result" 2
     } else {
 	set et_vect_call_ceil_saved($et_index) 0
-	if { [istarget aarch64*-*-*] } {
+	if { [istarget aarch64*-*-*]
+	     || [istarget amdgcn-*-*] } {
 	  set et_vect_call_ceil_saved($et_index) 1
 	}
     }
@@ -7196,7 +7244,8 @@ proc check_effective_target_vect_call_floorf { } {
 	verbose "check_effective_target_vect_call_floorf: using cached result" 2
     } else {
 	set et_vect_call_floorf_saved($et_index) 0
-	if { [istarget aarch64*-*-*] } {
+	if { [istarget aarch64*-*-*]
+	     || [istarget amdgcn-*-*] } {
 	  set et_vect_call_floorf_saved($et_index) 1
 	}
     }
@@ -8360,7 +8409,8 @@ proc check_effective_target_gld { } {
 # (LTO) support.
 
 proc check_effective_target_lto { } {
-    if { [istarget nvptx-*-*] } {
+    if { [istarget nvptx-*-*]
+	 || [istarget amdgcn-*-*] } {
 	return 0;
     }
     return [check_no_compiler_messages lto object {
@@ -8678,6 +8728,8 @@ proc check_vect_support_and_set_flags { } {
 	    lappend DEFAULT_VECTCFLAGS "-march=z14" "-mzarch"
             set dg-do-what-default compile
         }
+    } elseif [istarget amdgcn-*-*] {
+        set dg-do-what-default run
     } else {
         return 0
     }

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 01/25] Handle vectors that don't fit in an integer.
  2018-09-05 11:49 ` [PATCH 01/25] Handle vectors that don't fit in an integer ams
@ 2018-09-05 11:54   ` Jakub Jelinek
  2018-09-14 16:03   ` Richard Sandiford
  1 sibling, 0 replies; 187+ messages in thread
From: Jakub Jelinek @ 2018-09-05 11:54 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

On Wed, Sep 05, 2018 at 12:48:49PM +0100, ams@codesourcery.com wrote:
> +++ b/gcc/combine.c
> @@ -8621,7 +8621,13 @@ gen_lowpart_or_truncate (machine_mode mode, rtx x)
>      {
>        /* Bit-cast X into an integer mode.  */
>        if (!SCALAR_INT_MODE_P (GET_MODE (x)))
> -	x = gen_lowpart (int_mode_for_mode (GET_MODE (x)).require (), x);
> +	{
> +	  enum machine_mode imode =
> +	    int_mode_for_mode (GET_MODE (x)).require ();

Just a formatting nit, not a review - = should be on the next line.

	Jakub

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 10/25] Convert BImode vectors.
  2018-09-05 11:50 ` [PATCH 10/25] Convert BImode vectors ams
@ 2018-09-05 11:56   ` Jakub Jelinek
  2018-09-05 12:05   ` Richard Biener
  2018-09-17  8:51   ` Richard Sandiford
  2 siblings, 0 replies; 187+ messages in thread
From: Jakub Jelinek @ 2018-09-05 11:56 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

On Wed, Sep 05, 2018 at 12:50:25PM +0100, ams@codesourcery.com wrote:
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 
> 	gcc/
> 	* simplify-rtx.c (convert_packed_vector): New function.
> 	(simplify_immed_subreg): Recognised Boolean vectors and call
> 	convert_packed_vector.
> ---

> +      int elem_bitsize = (GET_MODE_SIZE (from_mode).to_constant()

Further formatting nits, no space before (.

> +			  * BITS_PER_UNIT) / num_elem;
> +      int elem_mask = (1 << elem_bitsize) - 1;
> +      HOST_WIDE_INT subreg_mask =

= at the end of line.

> +	(sizeof (HOST_WIDE_INT) == GET_MODE_SIZE (to_mode)
> +	 ? -1
> +	 : (((HOST_WIDE_INT)1 << (GET_MODE_SIZE (to_mode) * BITS_PER_UNIT))
> +	    - 1));
> +  /* Vectors with multiple elements per byte are a special case.  */

> +  if ((VECTOR_MODE_P (innermode)
> +       && ((GET_MODE_NUNITS (innermode).to_constant()
> +	    / GET_MODE_SIZE(innermode).to_constant()) > 1))

Missing spaces before ( several times.

	Jakub

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 10/25] Convert BImode vectors.
  2018-09-05 11:50 ` [PATCH 10/25] Convert BImode vectors ams
  2018-09-05 11:56   ` Jakub Jelinek
@ 2018-09-05 12:05   ` Richard Biener
  2018-09-05 12:40     ` Andrew Stubbs
  2018-09-17  8:51   ` Richard Sandiford
  2 siblings, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-05 12:05 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: GCC Patches

On Wed, Sep 5, 2018 at 1:51 PM <ams@codesourcery.com> wrote:
>
>
> GCN uses V64BImode to represent vector masks in the middle-end, and DImode
> bit-masks to represent them in the back-end.  These must be converted at expand
> time and the most convenient way is to simply use a SUBREG.

x86 with AVX512 uses SImode in the middle-end as well via the get_mask_mode
vectorization target hook.  Maybe you can avoid another special-case
by piggy-backing on
that?

> This works fine except that simplify_subreg needs to be able to convert
> immediates, mostly for REG_EQUAL and REG_EQUIV, and currently does not know how
> to convert vectors to integers where there is more than one element per byte.
>
> This patch implements such conversions for the cases that we need.
>
> I don't know why this is not a problem for other targets that use BImode
> vectors, such as ARM SVE, so it's possible I missed some magic somewhere?
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>
>         gcc/
>         * simplify-rtx.c (convert_packed_vector): New function.
>         (simplify_immed_subreg): Recognised Boolean vectors and call
>         convert_packed_vector.
> ---
>  gcc/simplify-rtx.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-05 11:49 ` [PATCH 04/25] SPECIAL_REGNO_P ams
@ 2018-09-05 12:21   ` Joseph Myers
  2018-09-11 22:42   ` Jeff Law
  2018-09-12 15:31   ` Richard Henderson
  2 siblings, 0 replies; 187+ messages in thread
From: Joseph Myers @ 2018-09-05 12:21 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

On Wed, 5 Sep 2018, ams@codesourcery.com wrote:

> This patch creates a new macro SPECIAL_REGNO_P which disables regrename.  In
> other words, the register is fixed once allocated.

Creating new target macros is generally suspect - the presumption is that 
target hooks should be used instead, unless it's clear the macro is part 
of a group of very closely related macros that should all become hooks at 
the same time (e.g. if adding a new one of the set of *_TYPE macros for 
standard typedefs, a macro would probably be appropriate rather than 
making just the new one into a hook).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 20/25] GCN libgcc.
  2018-09-05 11:52 ` [PATCH 20/25] GCN libgcc ams
@ 2018-09-05 12:32   ` Joseph Myers
  2018-11-09 18:49   ` Jeff Law
  1 sibling, 0 replies; 187+ messages in thread
From: Joseph Myers @ 2018-09-05 12:32 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

On Wed, 5 Sep 2018, ams@codesourcery.com wrote:

> diff --git a/libgcc/config/gcn/crt0.c b/libgcc/config/gcn/crt0.c
> new file mode 100644
> index 0000000..f4f367b
> --- /dev/null
> +++ b/libgcc/config/gcn/crt0.c
> @@ -0,0 +1,23 @@
> +/* Copyright (C) 2017 Free Software Foundation, Inc.

Copyright ranges on all new files should include 2018.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 10/25] Convert BImode vectors.
  2018-09-05 12:05   ` Richard Biener
@ 2018-09-05 12:40     ` Andrew Stubbs
  2018-09-05 12:44       ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-05 12:40 UTC (permalink / raw)
  To: Richard Biener, Stubbs, Andrew; +Cc: GCC Patches

On 05/09/18 13:05, Richard Biener wrote:
> On Wed, Sep 5, 2018 at 1:51 PM <ams@codesourcery.com> wrote:
>>
>>
>> GCN uses V64BImode to represent vector masks in the middle-end, and DImode
>> bit-masks to represent them in the back-end.  These must be converted at expand
>> time and the most convenient way is to simply use a SUBREG.
> 
> x86 with AVX512 uses SImode in the middle-end as well via the get_mask_mode
> vectorization target hook.  Maybe you can avoid another special-case
> by piggy-backing on
> that?

That's exactly what I wanted to do, but I found that returning 
non-vector modes ran into trouble further down the road.  I don't recall 
the exact details now, but there were assertion failures and failures to 
vectorize.

That was in a GCC 8 codebase though, so is the AVX thing a recent change?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 10/25] Convert BImode vectors.
  2018-09-05 12:40     ` Andrew Stubbs
@ 2018-09-05 12:44       ` Richard Biener
  2018-09-11 14:36         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-05 12:44 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: Stubbs, Andrew, GCC Patches

On Wed, Sep 5, 2018 at 2:40 PM Andrew Stubbs <andrew_stubbs@mentor.com> wrote:
>
> On 05/09/18 13:05, Richard Biener wrote:
> > On Wed, Sep 5, 2018 at 1:51 PM <ams@codesourcery.com> wrote:
> >>
> >>
> >> GCN uses V64BImode to represent vector masks in the middle-end, and DImode
> >> bit-masks to represent them in the back-end.  These must be converted at expand
> >> time and the most convenient way is to simply use a SUBREG.
> >
> > x86 with AVX512 uses SImode in the middle-end as well via the get_mask_mode
> > vectorization target hook.  Maybe you can avoid another special-case
> > by piggy-backing on
> > that?
>
> That's exactly what I wanted to do, but I found that returning
> non-vector modes ran into trouble further down the road.  I don't recall
> the exact details now, but there were assertion failures and failures to
> vectorize.
>
> That was in a GCC 8 codebase though, so is the AVX thing a recent change?

No.  You might want to look into the x86 backend if there's maybe more tweaks
needed when using non-vector mask modes.

Richard.

> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 05/25] Add sorry_at diagnostic function.
  2018-09-05 11:49 ` [PATCH 05/25] Add sorry_at diagnostic function ams
@ 2018-09-05 13:39   ` David Malcolm
  2018-09-05 13:41     ` David Malcolm
  0 siblings, 1 reply; 187+ messages in thread
From: David Malcolm @ 2018-09-05 13:39 UTC (permalink / raw)
  To: ams, gcc-patches

On Wed, 2018-09-05 at 12:49 +0100, ams@codesourcery.com wrote:
> The plain "sorry" diagnostic only gives the "current" location, which
> is
> typically the last line of the function or translation unit by time
> we get to
> the back end.
> 
> GCN uses "sorry" to report unsupported language features, such as
> static
> constructors, so it's useful to have a "sorry_at" variant.
> 
> This patch implements "sorry_at" according to the pattern of the
> other "at"
> variants.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 
> 	gcc/
> 	* diagnostic-core.h (sorry_at): New prototype.
> 	* diagnostic.c (sorry_at): New function.
> ---
>  gcc/diagnostic-core.h |  1 +
>  gcc/diagnostic.c      | 11 +++++++++++
>  2 files changed, 12 insertions(+)

OK, thanks.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 21/25] GCN Back-end (part 1/2).
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (23 preceding siblings ...)
  2018-09-05 11:53 ` [PATCH 25/25] Port testsuite to GCN ams
@ 2018-09-05 13:40 ` Andrew Stubbs
  2018-11-09 19:11   ` Jeff Law
  2018-09-05 13:43 ` [PATCH 21/25] GCN Back-end (part 2/2) Andrew Stubbs
  25 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-05 13:40 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

This part initially failed to send due to size.

This is the main portion of the GCN back-end, plus the configuration
adjustments needed to build it.

The config.sub patch is here so people can try it, but I'm aware that 
needs to
be committed elsewhere first.

The back-end contains various bits that support OpenACC and OpenMP, but the
middle-end and libgomp patches are missing.  I included them here because
they're harmless and carving up the files seems like unnecessary effort. 
  The
remaining offload support will be posted at a later date.

The gcn-run.c is a separate tool that can run a GCN program on a GPU using
the ROCm drivers and HSA runtime libraries.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>.......    Kwok Cheung Yeung  <kcy@codesourcery.com>
>.......    Julian Brown  <julian@codesourcery.com>
>.......    Tom de Vries  <tom@codesourcery.com>
>.......    Jan Hubicka  <hubicka@ucw.cz>
>.......    Martin Jambor  <mjambor@suse.cz>

>.......* config.sub: Recognize amdgcn*-*-amdhsa.
>.......* configure.ac: Likewise.
>.......* configure: Regenerate.

>.......gcc/
>.......* common/config/gcn/gcn-common.c: New file.
>.......* config.gcc: Add amdgcn*-*-amdhsa configuration.
>.......* config/gcn/constraints.md: New file.
>.......* config/gcn/driver-gcn.c: New file.
>.......* config/gcn/gcn-builtins.def: New file.
>.......* config/gcn/gcn-hsa.h: New file.
>.......* config/gcn/gcn-modes.def: New file.
>.......* config/gcn/gcn-opts.h: New file.
>.......* config/gcn/gcn-passes.def: New file.
>.......* config/gcn/gcn-protos.h: New file.
>.......* config/gcn/gcn-run.c: New file.
>.......* config/gcn/gcn-tree.c: New file.
>.......* config/gcn/gcn-valu.md: New file.
>.......* config/gcn/gcn.c: New file.
>.......* config/gcn/gcn.h: New file.
>.......* config/gcn/gcn.md: New file.
>.......* config/gcn/gcn.opt: New file.
>.......* config/gcn/mkoffload.c: New file.
>.......* config/gcn/offload.h: New file.
>.......* config/gcn/predicates.md: New file.
>.......* config/gcn/t-gcn-hsa: New file.

[-- Attachment #2: 0021-gcn-port-pt1.patch --]
[-- Type: text/x-patch, Size: 202361 bytes --]

diff --git a/config.sub b/config.sub
index c95acc6..33115a5 100755
--- a/config.sub
+++ b/config.sub
@@ -572,6 +572,7 @@ case $basic_machine in
 	| alpha | alphaev[4-8] | alphaev56 | alphaev6[78] | alphapca5[67] \
 	| alpha64 | alpha64ev[4-8] | alpha64ev56 | alpha64ev6[78] | alpha64pca5[67] \
 	| am33_2.0 \
+	| amdgcn \
 	| arc | arceb \
 	| arm | arm[bl]e | arme[lb] | armv[2-8] | armv[3-8][lb] | armv6m | armv[78][arm] \
 	| avr | avr32 \
@@ -909,6 +910,9 @@ case $basic_machine in
 	fx2800)
 		basic_machine=i860-alliant
 		;;
+	amdgcn)
+		basic_machine=amdgcn-unknown
+		;;
 	genix)
 		basic_machine=ns32k-ns
 		;;
@@ -1524,6 +1528,8 @@ case $os in
 		;;
 	*-eabi)
 		;;
+	amdhsa)
+		;;
 	*)
 		echo Invalid configuration \`"$1"\': system \`"$os"\' not recognized 1>&2
 		exit 1
@@ -1548,6 +1554,9 @@ case $basic_machine in
 	spu-*)
 		os=elf
 		;;
+	amdgcn-*)
+		os=-amdhsa
+		;;
 	*-acorn)
 		os=riscix1.2
 		;;
diff --git a/configure b/configure
index dd9fbe4..fb311ce 100755
--- a/configure
+++ b/configure
@@ -3569,6 +3569,8 @@ case "${target}" in
     noconfigdirs="$noconfigdirs ld gas gdb gprof"
     noconfigdirs="$noconfigdirs sim target-rda"
     ;;
+  amdgcn*-*-*)
+    ;;
   arm-*-darwin*)
     noconfigdirs="$noconfigdirs ld gas gdb gprof"
     noconfigdirs="$noconfigdirs sim target-rda"
diff --git a/configure.ac b/configure.ac
index a0b0917..35acf25 100644
--- a/configure.ac
+++ b/configure.ac
@@ -903,6 +903,8 @@ case "${target}" in
     noconfigdirs="$noconfigdirs ld gas gdb gprof"
     noconfigdirs="$noconfigdirs sim target-rda"
     ;;
+  amdgcn*-*-*)
+    ;;
   arm-*-darwin*)
     noconfigdirs="$noconfigdirs ld gas gdb gprof"
     noconfigdirs="$noconfigdirs sim target-rda"
diff --git a/gcc/common/config/gcn/gcn-common.c b/gcc/common/config/gcn/gcn-common.c
new file mode 100644
index 0000000..275bfd5
--- /dev/null
+++ b/gcc/common/config/gcn/gcn-common.c
@@ -0,0 +1,38 @@
+/* Common hooks for GCN
+   Copyright (C) 2016-2017 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "common/common-target.h"
+#include "common/common-target-def.h"
+#include "opts.h"
+#include "flags.h"
+#include "params.h"
+
+/* Set default optimization options.  */
+static const struct default_options gcn_option_optimization_table[] =
+  {
+    { OPT_LEVELS_1_PLUS, OPT_fomit_frame_pointer, NULL, 1 },
+    { OPT_LEVELS_NONE, 0, NULL, 0 }
+  };
+
+#undef  TARGET_OPTION_OPTIMIZATION_TABLE
+#define TARGET_OPTION_OPTIMIZATION_TABLE gcn_option_optimization_table
+
+struct gcc_targetm_common targetm_common = TARGETM_COMMON_INITIALIZER;
diff --git a/gcc/config.gcc b/gcc/config.gcc
index f81cf76..d28bee5 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -312,6 +312,10 @@ alpha*-*-*)
 	cpu_type=alpha
 	extra_options="${extra_options} g.opt"
 	;;
+amdgcn*)
+	cpu_type=gcn
+	use_gcc_stdint=wrap
+	;;
 am33_2.0-*-linux*)
 	cpu_type=mn10300
 	;;
@@ -1376,6 +1380,19 @@ ft32-*-elf)
 	tm_file="dbxelf.h elfos.h newlib-stdint.h ${tm_file}"
 	tmake_file="${tmake_file} ft32/t-ft32"
 	;;
+amdgcn-*-amdhsa)
+	tm_file="dbxelf.h elfos.h gcn/gcn-hsa.h gcn/gcn.h newlib-stdint.h"
+	tmake_file="gcn/t-gcn-hsa"
+	native_system_header_dir=/include
+	extra_modes=gcn/gcn-modes.def
+	extra_objs="${extra_objs} gcn-tree.o"
+	extra_gcc_objs="driver-gcn.o"
+	extra_programs="${extra_programs} gcn-run\$(exeext)"
+	if test x$enable_as_accelerator = xyes; then
+		extra_programs="${extra_programs} mkoffload\$(exeext)"
+		tm_file="${tm_file} gcn/offload.h"
+	fi
+	;;
 moxie-*-elf)
 	gas=yes
 	gnu_ld=yes
@@ -4042,6 +4059,24 @@ case "${target}" in
 		esac
 		;;
 
+	amdgcn-*-*)
+		supported_defaults="arch tune"
+
+		for which in arch tune; do
+			eval "val=\$with_$which"
+			case ${val} in
+			"" | carrizo | fiji | gfx900 )
+				# OK
+				;;
+			*)
+				echo "Unknown cpu used in --with-$which=$val." 1>&2
+				exit 1
+				;;
+			esac
+		done
+		[ "x$with_arch" = x ] && with_arch=fiji
+		;;
+
 	hppa*-*-*)
 		supported_defaults="arch schedule"
 
diff --git a/gcc/config/gcn/constraints.md b/gcc/config/gcn/constraints.md
new file mode 100644
index 0000000..9ebeb97
--- /dev/null
+++ b/gcc/config/gcn/constraints.md
@@ -0,0 +1,139 @@
+;; Constraint definitions for GCN.
+;; Copyright (C) 2016-2017 Free Software Foundation, Inc.
+;;
+;; This file is part of GCC.
+;;
+;; GCC is free software; you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation; either version 3, or (at your option)
+;; any later version.
+;;
+;; GCC is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+;;
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING3.  If not see
+;; <http://www.gnu.org/licenses/>.
+
+(define_constraint "I"
+  "Inline integer constant"
+  (and (match_code "const_int")
+       (match_test "ival >= -16 && ival <= 64")))
+
+(define_constraint "J"
+  "Signed integer 16-bit inline constant"
+  (and (match_code "const_int")
+       (match_test "((unsigned HOST_WIDE_INT) ival + 0x8000) < 0x10000")))
+
+(define_constraint "Kf"
+  "Immeditate constant -1"
+  (and (match_code "const_int")
+       (match_test "ival == -1")))
+
+(define_constraint "L"
+  "Unsigned integer 15-bit constant"
+  (and (match_code "const_int")
+       (match_test "((unsigned HOST_WIDE_INT) ival) < 0x8000")))
+
+(define_constraint "A"
+  "Inline immediate parameter"
+  (and (match_code "const_int,const_double,const_vector")
+       (match_test "gcn_inline_constant_p (op)")))
+
+(define_constraint "B"
+  "Immediate 32-bit parameter"
+  (and (match_code "const_int,const_double,const_vector")
+	(match_test "gcn_constant_p (op)")))
+
+(define_constraint "C"
+  "Immediate 32-bit parameter zero-extended to 64-bits"
+  (and (match_code "const_int,const_double,const_vector")
+	(match_test "gcn_constant64_p (op)")))
+
+(define_constraint "DA"
+  "Splittable inline immediate 64-bit parameter"
+  (and (match_code "const_int,const_double,const_vector")
+       (match_test "gcn_inline_constant64_p (op)")))
+
+(define_constraint "DB"
+  "Splittable immediate 64-bit parameter"
+  (match_code "const_int,const_double,const_vector"))
+
+(define_constraint "U"
+  "unspecified value"
+  (match_code "unspec"))
+
+(define_constraint "Y"
+  "Symbol or label for relative calls"
+  (match_code "symbol_ref,label_ref"))
+
+(define_register_constraint "v" "VGPR_REGS"
+  "VGPR registers")
+
+(define_register_constraint "Sg" "SGPR_REGS"
+  "SGPR registers")
+
+(define_register_constraint "SD" "SGPR_DST_REGS"
+  "registers useable as a destination of scalar operation")
+
+(define_register_constraint "SS" "SGPR_SRC_REGS"
+  "registers useable as a source of scalar operation")
+
+(define_register_constraint "Sm" "SGPR_MEM_SRC_REGS"
+  "registers useable as a source of scalar memory operation")
+
+(define_register_constraint "Sv" "SGPR_VOP3A_SRC_REGS"
+  "registers useable as a source of VOP3A instruction")
+
+(define_register_constraint "ca" "ALL_CONDITIONAL_REGS"
+  "SCC VCCZ or EXECZ")
+
+(define_register_constraint "cs" "SCC_CONDITIONAL_REG"
+  "SCC")
+
+(define_register_constraint "cV" "VCC_CONDITIONAL_REG"
+  "VCC")
+
+(define_register_constraint "e" "EXEC_MASK_REG"
+  "EXEC")
+
+(define_special_memory_constraint "RB"
+  "Buffer memory address to scratch memory."
+  (and (match_code "mem")
+       (match_test "AS_SCRATCH_P (MEM_ADDR_SPACE (op))")))
+
+(define_special_memory_constraint "RF"
+  "Buffer memory address to flat memory."
+  (and (match_code "mem")
+       (match_test "AS_FLAT_P (MEM_ADDR_SPACE (op))
+		    && gcn_flat_address_p (XEXP (op, 0), mode)")))
+
+(define_special_memory_constraint "RS"
+  "Buffer memory address to scalar flat memory."
+  (and (match_code "mem")
+       (match_test "AS_SCALAR_FLAT_P (MEM_ADDR_SPACE (op))
+		    && gcn_scalar_flat_mem_p (op)")))
+
+(define_special_memory_constraint "RL"
+  "Buffer memory address to LDS memory."
+  (and (match_code "mem")
+       (match_test "AS_LDS_P (MEM_ADDR_SPACE (op))")))
+
+(define_special_memory_constraint "RG"
+  "Buffer memory address to GDS memory."
+  (and (match_code "mem")
+       (match_test "AS_GDS_P (MEM_ADDR_SPACE (op))")))
+
+(define_special_memory_constraint "RD"
+  "Buffer memory address to GDS or LDS memory."
+  (and (match_code "mem")
+       (ior (match_test "AS_GDS_P (MEM_ADDR_SPACE (op))")
+	    (match_test "AS_LDS_P (MEM_ADDR_SPACE (op))"))))
+
+(define_special_memory_constraint "RM"
+  "Memory address to global (main) memory."
+  (and (match_code "mem")
+       (match_test "AS_GLOBAL_P (MEM_ADDR_SPACE (op))
+		    && gcn_global_address_p (XEXP (op, 0))")))
diff --git a/gcc/config/gcn/driver-gcn.c b/gcc/config/gcn/driver-gcn.c
new file mode 100644
index 0000000..21e8c69
--- /dev/null
+++ b/gcc/config/gcn/driver-gcn.c
@@ -0,0 +1,32 @@
+/* Subroutines for the gcc driver.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+
+const char *
+last_arg_spec_function (int argc, const char **argv)
+{
+  if (argc == 0)
+    return NULL;
+
+  return argv[argc-1];
+}
diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def
new file mode 100644
index 0000000..1cf66d2
--- /dev/null
+++ b/gcc/config/gcn/gcn-builtins.def
@@ -0,0 +1,116 @@
+/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The first argument to these macros is the return type of the builtin,
+   the rest are arguments of the builtin.  */
+#define _A1(a)	       {a, GCN_BTI_END_OF_PARAMS}
+#define _A2(a,b)       {a, b, GCN_BTI_END_OF_PARAMS}
+#define _A3(a,b,c)     {a, b, c, GCN_BTI_END_OF_PARAMS}
+#define _A4(a,b,c,d)   {a, b, c, d, GCN_BTI_END_OF_PARAMS}
+#define _A5(a,b,c,d,e) {a, b, c, d, e, GCN_BTI_END_OF_PARAMS}
+
+DEF_BUILTIN (FLAT_LOAD_INT32, 1 /*CODE_FOR_flat_load_v64si*/,
+	     "flat_load_int32", B_INSN,
+	     _A3 (GCN_BTI_V64SI, GCN_BTI_EXEC, GCN_BTI_V64SI),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (FLAT_LOAD_PTR_INT32, 2 /*CODE_FOR_flat_load_ptr_v64si */,
+	     "flat_load_ptr_int32", B_INSN,
+	     _A4 (GCN_BTI_V64SI, GCN_BTI_EXEC, GCN_BTI_SIPTR, GCN_BTI_V64SI),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (FLAT_STORE_PTR_INT32, 3 /*CODE_FOR_flat_store_ptr_v64si */,
+	     "flat_store_ptr_int32", B_INSN,
+	     _A5 (GCN_BTI_VOID, GCN_BTI_EXEC, GCN_BTI_SIPTR, GCN_BTI_V64SI,
+		  GCN_BTI_V64SI),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (FLAT_LOAD_PTR_FLOAT, 2 /*CODE_FOR_flat_load_ptr_v64sf */,
+	     "flat_load_ptr_float", B_INSN,
+	     _A4 (GCN_BTI_V64SF, GCN_BTI_EXEC, GCN_BTI_SFPTR, GCN_BTI_V64SI),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (FLAT_STORE_PTR_FLOAT, 3 /*CODE_FOR_flat_store_ptr_v64sf */,
+	     "flat_store_ptr_float", B_INSN,
+	     _A5 (GCN_BTI_VOID, GCN_BTI_EXEC, GCN_BTI_SFPTR, GCN_BTI_V64SI,
+		  GCN_BTI_V64SF),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (SQRTVF, 3 /*CODE_FOR_sqrtvf */,
+	     "sqrtvf", B_INSN,
+	     _A2 (GCN_BTI_V64SF, GCN_BTI_V64SF),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (SQRTF, 3 /*CODE_FOR_sqrtf */,
+	     "sqrtf", B_INSN,
+	     _A2 (GCN_BTI_SF, GCN_BTI_SF),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (CMP_SWAP, -1,
+	    "cmp_swap", B_INSN,
+	    _A4 (GCN_BTI_UINT, GCN_BTI_VOIDPTR, GCN_BTI_UINT, GCN_BTI_UINT),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (CMP_SWAPLL, -1,
+	    "cmp_swapll", B_INSN,
+	    _A4 (GCN_BTI_LLUINT,
+		 GCN_BTI_VOIDPTR, GCN_BTI_LLUINT, GCN_BTI_LLUINT),
+	    gcn_expand_builtin_1)
+
+/* DEF_BUILTIN_BINOP_INT_FP creates many variants of a builtin function for a
+   given operation.  The first argument will give base to the identifier of a
+   particular builtin, the second will be used to form the name of the patter
+   used to expand it to and the third will be used to create the user-visible
+   builtin identifier.  */
+
+DEF_BUILTIN_BINOP_INT_FP (ADD, add, "add")
+DEF_BUILTIN_BINOP_INT_FP (SUB, sub, "sub")
+
+DEF_BUILTIN_BINOP_INT_FP (AND, and, "and")
+DEF_BUILTIN_BINOP_INT_FP (IOR, ior, "or")
+DEF_BUILTIN_BINOP_INT_FP (XOR, xor, "xor")
+
+/* OpenMP.  */
+
+DEF_BUILTIN (OMP_DIM_SIZE, CODE_FOR_oacc_dim_size,
+	     "dim_size", B_INSN,
+	     _A2 (GCN_BTI_INT, GCN_BTI_INT),
+	     gcn_expand_builtin_1)
+DEF_BUILTIN (OMP_DIM_POS, CODE_FOR_oacc_dim_pos,
+	     "dim_pos", B_INSN,
+	     _A2 (GCN_BTI_INT, GCN_BTI_INT),
+	     gcn_expand_builtin_1)
+
+/* OpenACC.  */
+
+DEF_BUILTIN (ACC_SINGLE_START, -1, "single_start", B_INSN, _A1 (GCN_BTI_BOOL),
+	     gcn_expand_builtin_1)
+
+DEF_BUILTIN (ACC_SINGLE_COPY_START, -1, "single_copy_start", B_INSN,
+	     _A1 (GCN_BTI_LDS_VOIDPTR), gcn_expand_builtin_1)
+
+DEF_BUILTIN (ACC_SINGLE_COPY_END, -1, "single_copy_end", B_INSN,
+	     _A2 (GCN_BTI_VOID, GCN_BTI_LDS_VOIDPTR), gcn_expand_builtin_1)
+
+DEF_BUILTIN (ACC_BARRIER, -1, "acc_barrier", B_INSN, _A1 (GCN_BTI_VOID),
+	     gcn_expand_builtin_1)
+
+
+#undef _A1
+#undef _A2
+#undef _A3
+#undef _A4
+#undef _A5
diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
new file mode 100644
index 0000000..182062d
--- /dev/null
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -0,0 +1,129 @@
+/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef OBJECT_FORMAT_ELF
+ #error elf.h included before elfos.h
+#endif
+
+#define TEXT_SECTION_ASM_OP "\t.section\t.text"
+#define BSS_SECTION_ASM_OP  "\t.section\t.bss"
+#define GLOBAL_ASM_OP       "\t.globl\t"
+#define DATA_SECTION_ASM_OP "\t.data\t"
+#define SET_ASM_OP          "\t.set\t"
+#define LOCAL_LABEL_PREFIX  "."
+#define USER_LABEL_PREFIX   ""
+#define ASM_COMMENT_START   ";"
+#define TARGET_ASM_NAMED_SECTION default_elf_asm_named_section
+
+#define ASM_OUTPUT_ALIGNED_BSS(FILE, DECL, NAME, SIZE, ALIGN) \
+	    asm_output_aligned_bss (FILE, DECL, NAME, SIZE, ALIGN)
+
+#undef ASM_DECLARE_FUNCTION_NAME
+#define ASM_DECLARE_FUNCTION_NAME(FILE, NAME, DECL) \
+  gcn_hsa_declare_function_name ((FILE), (NAME), (DECL))
+
+#undef ASM_OUTPUT_ALIGNED_COMMON
+#define ASM_OUTPUT_ALIGNED_COMMON(FILE, NAME, SIZE, ALIGNMENT)	  \
+ (fprintf ((FILE), "%s", COMMON_ASM_OP),			  \
+  assemble_name ((FILE), (NAME)),				  \
+  fprintf ((FILE), "," HOST_WIDE_INT_PRINT_UNSIGNED ",%u\n",	  \
+	   (SIZE) > 0 ? (SIZE) : 1, (ALIGNMENT) / BITS_PER_UNIT))
+
+#define ASM_OUTPUT_LABEL(FILE,NAME) \
+  do { assemble_name (FILE, NAME); fputs (":\n", FILE); } while (0)
+
+#define ASM_OUTPUT_LABELREF(FILE, NAME) \
+  asm_fprintf (FILE, "%U%s", default_strip_name_encoding (NAME))
+
+extern unsigned int gcn_local_sym_hash (const char *name);
+
+/* The HSA runtime puts all global and local symbols into a single per-kernel
+   variable map.  In cases where we have two local static symbols with the same
+   name in different compilation units, this causes multiple definition errors.
+   To avoid this, we add a decoration to local symbol names based on a hash of
+   a "module ID" passed to the compiler via the -mlocal-symbol-id option.  This
+   is far from perfect, but we expect static local variables to be rare in
+   offload code.  */
+
+#define ASM_FORMAT_PRIVATE_NAME(OUTVAR, NAME, NUMBER)		\
+  do {								\
+    (OUTVAR) = (char *) alloca (strlen (NAME) + 30);		\
+    if (local_symbol_id && *local_symbol_id)			\
+      sprintf ((OUTVAR), "%s.%u.%.8x", (NAME), (NUMBER),	\
+	       gcn_local_sym_hash (local_symbol_id));		\
+    else							\
+      sprintf ((OUTVAR), "%s.%u", (NAME), (NUMBER));		\
+  } while (0)
+
+#define ASM_OUTPUT_SYMBOL_REF(FILE, X) gcn_asm_output_symbol_ref (FILE, X)
+
+#define ASM_OUTPUT_ADDR_DIFF_ELT(FILE, BODY, VALUE, REL) \
+  fprintf (FILE, "\t.word .L%d-.L%d\n", VALUE, REL)
+
+#define ASM_OUTPUT_ADDR_VEC_ELT(FILE, VALUE) \
+  fprintf (FILE, "\t.word .L%d\n", VALUE)
+
+#define ASM_OUTPUT_ALIGN(FILE,LOG) \
+  do { if (LOG!=0) fprintf (FILE, "\t.align\t%d\n", 1<<(LOG)); } while (0)
+#define ASM_OUTPUT_ALIGN_WITH_NOP(FILE,LOG)	       \
+  do {						       \
+    if (LOG!=0)					       \
+      fprintf (FILE, "\t.p2alignl\t%d, 0xBF800000"     \
+	       " ; Fill value is 's_nop 0'\n", (LOG)); \
+  } while (0)
+
+#define ASM_APP_ON  ""
+#define ASM_APP_OFF ""
+
+/* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
+   supported for gcn.  */
+#define GOMP_SELF_SPECS ""
+
+/* Use LLVM assembler and linker options.  */
+#define ASM_SPEC  "-triple=amdgcn--amdhsa "	     \
+		  "%:last_arg(%{march=*:-mcpu=%*}) " \
+		  "-filetype=obj"
+/* Add -mlocal-symbol-id=<source-file-basename> unless the user (or mkoffload)
+   passes the option explicitly on the command line.  The option also causes
+   several dump-matching tests to fail in the testsuite, so the option is not
+   added when or tree dump/compare-debug options used in the testsuite are
+   present.
+   This has the potential for surprise, but a user can still use an explicit
+   -mlocal-symbol-id=<whatever> option manually together with -fdump-tree or
+   -fcompare-debug options.  */
+#define CC1_SPEC "%{!mlocal-symbol-id=*:%{!fdump-tree-*:"	\
+		 "%{!fdump-ipa-*:%{!fcompare-debug*:-mlocal-symbol-id=%b}}}}"
+#define LINK_SPEC "--pie"
+#define LIB_SPEC  "-lc"
+
+/* Provides a _start symbol to keep the linker happy.  */
+#define STARTFILE_SPEC "crt0.o%s"
+#define ENDFILE_SPEC   ""
+#define STANDARD_STARTFILE_PREFIX_2 ""
+
+/* The LLVM assembler rejects multiple -mcpu options, so we must drop
+   all but the last.  */
+extern const char *last_arg_spec_function (int argc, const char **argv);
+#define EXTRA_SPEC_FUNCTIONS	\
+    { "last_arg", last_arg_spec_function },
+
+#undef LOCAL_INCLUDE_DIR
+
+/* FIXME: review debug info settings */
+#define PREFERRED_DEBUGGING_TYPE   DWARF2_DEBUG
+#define DWARF2_DEBUGGING_INFO      1
+#define DWARF2_ASM_LINE_DEBUG_INFO 1
+#define EH_FRAME_THROUGH_COLLECT2  1
diff --git a/gcc/config/gcn/gcn-modes.def b/gcc/config/gcn/gcn-modes.def
new file mode 100644
index 0000000..6f273b0
--- /dev/null
+++ b/gcc/config/gcn/gcn-modes.def
@@ -0,0 +1,45 @@
+/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Half-precision floating point */
+FLOAT_MODE (HF, 2, 0);
+/* FIXME: No idea what format it is.  */
+ADJUST_FLOAT_FORMAT (HF, &ieee_half_format);
+
+/* Mask mode.  Used for the autovectorizer only, and converted to DImode
+   during the expand pass.  */
+VECTOR_BOOL_MODE (V64BI, 64, 8); /*		  V64BI */
+
+/* Native vector modes.  */
+VECTOR_MODE (INT, QI, 64);      /*		  V64QI */
+VECTOR_MODE (INT, HI, 64);      /*		  V64HI */
+VECTOR_MODE (INT, SI, 64);      /*		  V64SI */
+VECTOR_MODE (INT, DI, 64);      /*		  V64DI */
+VECTOR_MODE (INT, TI, 64);      /*		  V64TI */
+VECTOR_MODE (FLOAT, HF, 64);    /*		  V64HF */
+VECTOR_MODE (FLOAT, SF, 64);    /*		  V64SF */
+VECTOR_MODE (FLOAT, DF, 64);    /*		  V64DF */
+
+/* Vector units handle reads independently and thus no large alignment
+   needed.  */
+ADJUST_ALIGNMENT (V64QI, 1);
+ADJUST_ALIGNMENT (V64HI, 2);
+ADJUST_ALIGNMENT (V64SI, 4);
+ADJUST_ALIGNMENT (V64DI, 8);
+ADJUST_ALIGNMENT (V64TI, 16);
+ADJUST_ALIGNMENT (V64HF, 2);
+ADJUST_ALIGNMENT (V64SF, 4);
+ADJUST_ALIGNMENT (V64DF, 8);
diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
new file mode 100644
index 0000000..368e0b5
--- /dev/null
+++ b/gcc/config/gcn/gcn-opts.h
@@ -0,0 +1,36 @@
+/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef GCN_OPTS_H
+#define GCN_OPTS_H
+
+/* Which processor to generate code or schedule for.  */
+enum processor_type
+{
+  PROCESSOR_CARRIZO,
+  PROCESSOR_FIJI,
+  PROCESSOR_VEGA
+};
+
+/* Set in gcn_option_override.  */
+extern int gcn_isa;
+
+#define TARGET_GCN3 (gcn_isa == 3)
+#define TARGET_GCN3_PLUS (gcn_isa >= 3)
+#define TARGET_GCN5 (gcn_isa == 5)
+#define TARGET_GCN5_PLUS (gcn_isa >= 5)
+
+#endif
diff --git a/gcc/config/gcn/gcn-passes.def b/gcc/config/gcn/gcn-passes.def
new file mode 100644
index 0000000..a1e1d73
--- /dev/null
+++ b/gcc/config/gcn/gcn-passes.def
@@ -0,0 +1,19 @@
+/* Copyright (C) 2017-2018 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+   
+   GCC is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3, or (at your option) any later
+   version.
+   
+   GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+   
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+INSERT_PASS_AFTER (pass_omp_target_link, 1, pass_omp_gcn);
diff --git a/gcc/config/gcn/gcn-protos.h b/gcc/config/gcn/gcn-protos.h
new file mode 100644
index 0000000..16ec3ed
--- /dev/null
+++ b/gcc/config/gcn/gcn-protos.h
@@ -0,0 +1,144 @@
+/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _GCN_PROTOS_
+#define _GCN_PROTOS_
+
+extern void gcn_asm_output_symbol_ref (FILE *file, rtx x);
+extern tree gcn_builtin_decl (unsigned code, bool initialize_p);
+extern bool gcn_can_split_p (machine_mode, rtx);
+extern bool gcn_constant64_p (rtx);
+extern bool gcn_constant_p (rtx);
+extern rtx gcn_convert_mask_mode (rtx reg);
+extern char * gcn_expand_dpp_shr_insn (machine_mode, const char *, int, int);
+extern void gcn_expand_epilogue ();
+extern void gcn_expand_prologue ();
+extern rtx gcn_expand_reduc_scalar (machine_mode, rtx, int);
+extern rtx gcn_expand_scalar_to_vector_address (machine_mode, rtx, rtx, rtx);
+extern void gcn_expand_vector_init (rtx, rtx);
+extern bool gcn_flat_address_p (rtx, machine_mode);
+extern bool gcn_fp_constant_p (rtx, bool);
+extern rtx gcn_full_exec ();
+extern rtx gcn_full_exec_reg ();
+extern rtx gcn_gen_undef (machine_mode);
+extern bool gcn_global_address_p (rtx);
+extern tree gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
+						 const char *name);
+extern void gcn_goacc_adjust_gangprivate_decl (tree var);
+extern void gcn_goacc_reduction (gcall *call);
+extern bool gcn_hard_regno_rename_ok (unsigned int from_reg,
+				      unsigned int to_reg);
+extern machine_mode gcn_hard_regno_caller_save_mode (unsigned int regno,
+						     unsigned int nregs,
+						     machine_mode regmode);
+extern bool gcn_hard_regno_mode_ok (int regno, machine_mode mode);
+extern int gcn_hard_regno_nregs (int regno, machine_mode mode);
+extern void gcn_hsa_declare_function_name (FILE *file, const char *name,
+					   tree decl);
+extern HOST_WIDE_INT gcn_initial_elimination_offset (int, int);
+extern bool gcn_inline_constant64_p (rtx);
+extern bool gcn_inline_constant_p (rtx);
+extern int gcn_inline_fp_constant_p (rtx, bool);
+extern reg_class gcn_mode_code_base_reg_class (machine_mode, addr_space_t,
+					       int, int);
+extern rtx gcn_oacc_dim_pos (int dim);
+extern rtx gcn_oacc_dim_size (int dim);
+extern rtx gcn_operand_doublepart (machine_mode, rtx, int);
+extern rtx gcn_operand_part (machine_mode, rtx, int);
+extern bool gcn_regno_mode_code_ok_for_base_p (int, machine_mode,
+					       addr_space_t, int, int);
+extern reg_class gcn_regno_reg_class (int regno);
+extern rtx gcn_scalar_exec ();
+extern rtx gcn_scalar_exec_reg ();
+extern bool gcn_scalar_flat_address_p (rtx);
+extern bool gcn_scalar_flat_mem_p (rtx);
+extern bool gcn_sgpr_move_p (rtx, rtx);
+extern bool gcn_valid_move_p (machine_mode, rtx, rtx);
+extern rtx gcn_vec_constant (machine_mode, int);
+extern rtx gcn_vec_constant (machine_mode, rtx);
+extern bool gcn_vgpr_move_p (rtx, rtx);
+extern void print_operand_address (FILE *file, register rtx addr);
+extern void print_operand (FILE *file, rtx x, int code);
+extern bool regno_ok_for_index_p (int);
+
+enum gcn_cvt_t
+{
+  fix_trunc_cvt,
+  fixuns_trunc_cvt,
+  float_cvt,
+  floatuns_cvt,
+  extend_cvt,
+  trunc_cvt
+};
+
+extern bool gcn_valid_cvt_p (machine_mode from, machine_mode to,
+			     enum gcn_cvt_t op);
+
+#ifdef TREE_CODE
+extern void gcn_init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree,
+				      int);
+class gimple_opt_pass;
+extern gimple_opt_pass *make_pass_omp_gcn (gcc::context *ctxt);
+#endif
+
+/* Return true if MODE is valid for 1 VGPR register.  */
+
+inline bool
+vgpr_1reg_mode_p (machine_mode mode)
+{
+  return (mode == SImode || mode == SFmode || mode == HImode || mode == QImode
+	  || mode == V64QImode || mode == V64HImode || mode == V64SImode
+	  || mode == V64HFmode || mode == V64SFmode || mode == BImode);
+}
+
+/* Return true if MODE is valid for 1 SGPR register.  */
+
+inline bool
+sgpr_1reg_mode_p (machine_mode mode)
+{
+  return (mode == SImode || mode == SFmode || mode == HImode
+	  || mode == QImode || mode == BImode);
+}
+
+/* Return true if MODE is valid for pair of VGPR registers.  */
+
+inline bool
+vgpr_2reg_mode_p (machine_mode mode)
+{
+  return (mode == DImode || mode == DFmode
+	  || mode == V64DImode || mode == V64DFmode);
+}
+
+/* Return true if MODE can be handled directly by VGPR operations.  */
+
+inline bool
+vgpr_vector_mode_p (machine_mode mode)
+{
+  return (mode == V64QImode || mode == V64HImode
+	  || mode == V64SImode || mode == V64DImode
+	  || mode == V64HFmode || mode == V64SFmode || mode == V64DFmode);
+}
+
+
+/* Return true if MODE is valid for pair of SGPR registers.  */
+
+inline bool
+sgpr_2reg_mode_p (machine_mode mode)
+{
+  return mode == DImode || mode == DFmode || mode == V64BImode;
+}
+
+#endif
diff --git a/gcc/config/gcn/gcn-run.c b/gcc/config/gcn/gcn-run.c
new file mode 100644
index 0000000..3dea343
--- /dev/null
+++ b/gcc/config/gcn/gcn-run.c
@@ -0,0 +1,854 @@
+/* Run a stand-alone AMD GCN kernel.
+
+   Copyright 2017 Mentor Graphics Corporation
+   Copyright 2018 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+/* This program will run a compiled stand-alone GCN kernel on a GPU.
+
+   The kernel entry point's signature must use a standard main signature:
+
+     int main(int argc, char **argv)
+*/
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <malloc.h>
+#include <stdio.h>
+#include <string.h>
+#include <dlfcn.h>
+#include <unistd.h>
+#include <elf.h>
+#include <signal.h>
+
+/* These probably won't be in elf.h for a while.  */
+#ifndef R_AMDGPU_NONE
+#define R_AMDGPU_NONE		0
+#define R_AMDGPU_ABS32_LO	1	/* (S + A) & 0xFFFFFFFF  */
+#define R_AMDGPU_ABS32_HI	2	/* (S + A) >> 32  */
+#define R_AMDGPU_ABS64		3	/* S + A  */
+#define R_AMDGPU_REL32		4	/* S + A - P  */
+#define R_AMDGPU_REL64		5	/* S + A - P  */
+#define R_AMDGPU_ABS32		6	/* S + A  */
+#define R_AMDGPU_GOTPCREL	7	/* G + GOT + A - P  */
+#define R_AMDGPU_GOTPCREL32_LO	8	/* (G + GOT + A - P) & 0xFFFFFFFF  */
+#define R_AMDGPU_GOTPCREL32_HI	9	/* (G + GOT + A - P) >> 32  */
+#define R_AMDGPU_REL32_LO	10	/* (S + A - P) & 0xFFFFFFFF  */
+#define R_AMDGPU_REL32_HI	11	/* (S + A - P) >> 32  */
+#define reserved		12
+#define R_AMDGPU_RELATIVE64	13	/* B + A  */
+#endif
+
+#include "hsa.h"
+
+#ifndef HSA_RUNTIME_LIB
+#define HSA_RUNTIME_LIB "libhsa-runtime64.so"
+#endif
+
+#ifndef VERSION_STRING
+#define VERSION_STRING "(version unknown)"
+#endif
+
+bool debug = false;
+
+hsa_agent_t device = { 0 };
+hsa_queue_t *queue = NULL;
+uint64_t kernel = 0;
+hsa_executable_t executable = { 0 };
+
+hsa_region_t kernargs_region = { 0 };
+uint32_t kernarg_segment_size = 0;
+uint32_t group_segment_size = 0;
+uint32_t private_segment_size = 0;
+
+static void
+usage (const char *progname)
+{
+  printf ("Usage: %s [options] kernel [kernel-args]\n\n"
+	  "Options:\n"
+	  "  --help\n"
+	  "  --version\n"
+	  "  --debug\n", progname);
+}
+
+static void
+version (const char *progname)
+{
+  printf ("%s " VERSION_STRING "\n", progname);
+}
+
+/* As an HSA runtime is dlopened, following structure defines the necessary
+   function pointers.
+   Code adapted from libgomp.  */
+
+struct hsa_runtime_fn_info
+{
+  /* HSA runtime.  */
+  hsa_status_t (*hsa_status_string_fn) (hsa_status_t status,
+					const char **status_string);
+  hsa_status_t (*hsa_agent_get_info_fn) (hsa_agent_t agent,
+					 hsa_agent_info_t attribute,
+					 void *value);
+  hsa_status_t (*hsa_init_fn) (void);
+  hsa_status_t (*hsa_iterate_agents_fn)
+    (hsa_status_t (*callback) (hsa_agent_t agent, void *data), void *data);
+  hsa_status_t (*hsa_region_get_info_fn) (hsa_region_t region,
+					  hsa_region_info_t attribute,
+					  void *value);
+  hsa_status_t (*hsa_queue_create_fn)
+    (hsa_agent_t agent, uint32_t size, hsa_queue_type_t type,
+     void (*callback) (hsa_status_t status, hsa_queue_t *source, void *data),
+     void *data, uint32_t private_segment_size,
+     uint32_t group_segment_size, hsa_queue_t **queue);
+  hsa_status_t (*hsa_agent_iterate_regions_fn)
+    (hsa_agent_t agent,
+     hsa_status_t (*callback) (hsa_region_t region, void *data), void *data);
+  hsa_status_t (*hsa_executable_destroy_fn) (hsa_executable_t executable);
+  hsa_status_t (*hsa_executable_create_fn)
+    (hsa_profile_t profile, hsa_executable_state_t executable_state,
+     const char *options, hsa_executable_t *executable);
+  hsa_status_t (*hsa_executable_global_variable_define_fn)
+    (hsa_executable_t executable, const char *variable_name, void *address);
+  hsa_status_t (*hsa_executable_load_code_object_fn)
+    (hsa_executable_t executable, hsa_agent_t agent,
+     hsa_code_object_t code_object, const char *options);
+  hsa_status_t (*hsa_executable_freeze_fn) (hsa_executable_t executable,
+					    const char *options);
+  hsa_status_t (*hsa_signal_create_fn) (hsa_signal_value_t initial_value,
+					uint32_t num_consumers,
+					const hsa_agent_t *consumers,
+					hsa_signal_t *signal);
+  hsa_status_t (*hsa_memory_allocate_fn) (hsa_region_t region, size_t size,
+					  void **ptr);
+  hsa_status_t (*hsa_memory_copy_fn) (void *dst, const void *src,
+				      size_t size);
+  hsa_status_t (*hsa_memory_free_fn) (void *ptr);
+  hsa_status_t (*hsa_signal_destroy_fn) (hsa_signal_t signal);
+  hsa_status_t (*hsa_executable_get_symbol_fn)
+    (hsa_executable_t executable, const char *module_name,
+     const char *symbol_name, hsa_agent_t agent, int32_t call_convention,
+     hsa_executable_symbol_t *symbol);
+  hsa_status_t (*hsa_executable_symbol_get_info_fn)
+    (hsa_executable_symbol_t executable_symbol,
+     hsa_executable_symbol_info_t attribute, void *value);
+  void (*hsa_signal_store_relaxed_fn) (hsa_signal_t signal,
+				       hsa_signal_value_t value);
+  hsa_signal_value_t (*hsa_signal_wait_acquire_fn)
+    (hsa_signal_t signal, hsa_signal_condition_t condition,
+     hsa_signal_value_t compare_value, uint64_t timeout_hint,
+     hsa_wait_state_t wait_state_hint);
+  hsa_signal_value_t (*hsa_signal_wait_relaxed_fn)
+    (hsa_signal_t signal, hsa_signal_condition_t condition,
+     hsa_signal_value_t compare_value, uint64_t timeout_hint,
+     hsa_wait_state_t wait_state_hint);
+  hsa_status_t (*hsa_queue_destroy_fn) (hsa_queue_t *queue);
+  hsa_status_t (*hsa_code_object_deserialize_fn)
+    (void *serialized_code_object, size_t serialized_code_object_size,
+     const char *options, hsa_code_object_t *code_object);
+  uint64_t (*hsa_queue_load_write_index_relaxed_fn)
+    (const hsa_queue_t *queue);
+  void (*hsa_queue_store_write_index_relaxed_fn)
+    (const hsa_queue_t *queue, uint64_t value);
+  hsa_status_t (*hsa_shut_down_fn) ();
+};
+
+/* HSA runtime functions that are initialized in init_hsa_context.
+   Code adapted from libgomp.  */
+
+static struct hsa_runtime_fn_info hsa_fns;
+
+#define DLSYM_FN(function)			     \
+  hsa_fns.function##_fn = dlsym (handle, #function); \
+  if (hsa_fns.function##_fn == NULL)		     \
+    goto fail;
+
+static void
+init_hsa_runtime_functions (void)
+{
+  void *handle = dlopen (HSA_RUNTIME_LIB, RTLD_LAZY);
+  if (handle == NULL)
+    {
+      fprintf (stderr,
+	       "The HSA runtime is required to run GCN kernels on hardware.\n"
+	       "%s: File not found or could not be opened\n",
+	       HSA_RUNTIME_LIB);
+      exit (1);
+    }
+
+  DLSYM_FN (hsa_status_string)
+  DLSYM_FN (hsa_agent_get_info)
+  DLSYM_FN (hsa_init)
+  DLSYM_FN (hsa_iterate_agents)
+  DLSYM_FN (hsa_region_get_info)
+  DLSYM_FN (hsa_queue_create)
+  DLSYM_FN (hsa_agent_iterate_regions)
+  DLSYM_FN (hsa_executable_destroy)
+  DLSYM_FN (hsa_executable_create)
+  DLSYM_FN (hsa_executable_global_variable_define)
+  DLSYM_FN (hsa_executable_load_code_object)
+  DLSYM_FN (hsa_executable_freeze)
+  DLSYM_FN (hsa_signal_create)
+  DLSYM_FN (hsa_memory_allocate)
+  DLSYM_FN (hsa_memory_copy)
+  DLSYM_FN (hsa_memory_free)
+  DLSYM_FN (hsa_signal_destroy)
+  DLSYM_FN (hsa_executable_get_symbol)
+  DLSYM_FN (hsa_executable_symbol_get_info)
+  DLSYM_FN (hsa_signal_wait_acquire)
+  DLSYM_FN (hsa_signal_wait_relaxed)
+  DLSYM_FN (hsa_signal_store_relaxed)
+  DLSYM_FN (hsa_queue_destroy)
+  DLSYM_FN (hsa_code_object_deserialize)
+  DLSYM_FN (hsa_queue_load_write_index_relaxed)
+  DLSYM_FN (hsa_queue_store_write_index_relaxed)
+  DLSYM_FN (hsa_shut_down)
+
+  return;
+
+fail:
+  fprintf (stderr, "Failed to find HSA functions in " HSA_RUNTIME_LIB "\n");
+  exit (1);
+}
+
+#undef DLSYM_FN
+
+/* Report a fatal error STR together with the HSA error corresponding to
+   STATUS and terminate execution of the current process.  */
+
+static void
+hsa_fatal (const char *str, hsa_status_t status)
+{
+  const char *hsa_error_msg;
+  hsa_fns.hsa_status_string_fn (status, &hsa_error_msg);
+  fprintf (stderr, "%s: FAILED\nHSA Runtime message: %s\n", str,
+	   hsa_error_msg);
+  exit (1);
+}
+
+/* Helper macros to ensure we check the return values from the HSA Runtime.
+   These just keep the rest of the code a bit cleaner.  */
+
+#define XHSA_CMP(FN, CMP, MSG)		   \
+  do {					   \
+    hsa_status_t status = (FN);		   \
+    if (!(CMP))				   \
+      hsa_fatal ((MSG), status);	   \
+    else if (debug)			   \
+      fprintf (stderr, "%s: OK\n", (MSG)); \
+  } while (0)
+#define XHSA(FN, MSG) XHSA_CMP(FN, status == HSA_STATUS_SUCCESS, MSG)
+
+/* Callback of hsa_iterate_agents.
+   Called once for each available device, and returns "break" when a
+   suitable one has been found.  */
+
+static hsa_status_t
+get_gpu_agent (hsa_agent_t agent, void *data __attribute__ ((unused)))
+{
+  hsa_device_type_t device_type;
+  XHSA (hsa_fns.hsa_agent_get_info_fn (agent, HSA_AGENT_INFO_DEVICE,
+				       &device_type),
+	"Get agent type");
+
+  /* Select only GPU devices.  */
+  /* TODO: support selecting from multiple GPUs.  */
+  if (HSA_DEVICE_TYPE_GPU == device_type)
+    {
+      device = agent;
+      return HSA_STATUS_INFO_BREAK;
+    }
+
+  /* The device was not suitable.  */
+  return HSA_STATUS_SUCCESS;
+}
+
+/* Callback of hsa_iterate_regions.
+   Called once for each available memory region, and returns "break" when a
+   suitable one has been found.  */
+
+static hsa_status_t
+get_kernarg_region (hsa_region_t region, void *data __attribute__ ((unused)))
+{
+  /* Reject non-global regions.  */
+  hsa_region_segment_t segment;
+  hsa_fns.hsa_region_get_info_fn (region, HSA_REGION_INFO_SEGMENT, &segment);
+  if (HSA_REGION_SEGMENT_GLOBAL != segment)
+    return HSA_STATUS_SUCCESS;
+
+  /* Find a region with the KERNARG flag set.  */
+  hsa_region_global_flag_t flags;
+  hsa_fns.hsa_region_get_info_fn (region, HSA_REGION_INFO_GLOBAL_FLAGS,
+				  &flags);
+  if (flags & HSA_REGION_GLOBAL_FLAG_KERNARG)
+    {
+      kernargs_region = region;
+      return HSA_STATUS_INFO_BREAK;
+    }
+
+  /* The region was not suitable.  */
+  return HSA_STATUS_SUCCESS;
+}
+
+/* Initialize the HSA Runtime library and GPU device.  */
+
+static void
+init_device ()
+{
+  /* Load the shared library and find the API functions.  */
+  init_hsa_runtime_functions ();
+
+  /* Initialize the HSA Runtime.  */
+  XHSA (hsa_fns.hsa_init_fn (),
+	"Initialize run-time");
+
+  /* Select a suitable device.
+     The call-back function, get_gpu_agent, does the selection.  */
+  XHSA_CMP (hsa_fns.hsa_iterate_agents_fn (get_gpu_agent, NULL),
+	    status == HSA_STATUS_SUCCESS || status == HSA_STATUS_INFO_BREAK,
+	    "Find a device");
+
+  /* Initialize the queue used for launching kernels.  */
+  uint32_t queue_size = 0;
+  XHSA (hsa_fns.hsa_agent_get_info_fn (device, HSA_AGENT_INFO_QUEUE_MAX_SIZE,
+				       &queue_size),
+	"Find max queue size");
+  XHSA (hsa_fns.hsa_queue_create_fn (device, queue_size,
+				     HSA_QUEUE_TYPE_SINGLE, NULL,
+				     NULL, UINT32_MAX, UINT32_MAX, &queue),
+	"Set up a device queue");
+
+  /* Select a memory region for the kernel arguments.
+     The call-back function, get_kernarg_region, does the selection.  */
+  XHSA_CMP (hsa_fns.hsa_agent_iterate_regions_fn (device, get_kernarg_region,
+						  NULL),
+	    status == HSA_STATUS_SUCCESS || status == HSA_STATUS_INFO_BREAK,
+	    "Locate kernargs memory");
+}
+
+
+/* Read a whole input file.
+   Code copied from mkoffload. */
+
+static char *
+read_file (const char *filename, size_t *plen)
+{
+  size_t alloc = 16384;
+  size_t base = 0;
+  char *buffer;
+
+  FILE *stream = fopen (filename, "rb");
+  if (!stream)
+    {
+      perror (filename);
+      exit (1);
+    }
+
+  if (!fseek (stream, 0, SEEK_END))
+    {
+      /* Get the file size.  */
+      long s = ftell (stream);
+      if (s >= 0)
+	alloc = s + 100;
+      fseek (stream, 0, SEEK_SET);
+    }
+  buffer = malloc (alloc);
+
+  for (;;)
+    {
+      size_t n = fread (buffer + base, 1, alloc - base - 1, stream);
+
+      if (!n)
+	break;
+      base += n;
+      if (base + 1 == alloc)
+	{
+	  alloc *= 2;
+	  buffer = realloc (buffer, alloc);
+	}
+    }
+  buffer[base] = 0;
+  *plen = base;
+
+  fclose (stream);
+
+  return buffer;
+}
+
+/* Read a HSA Code Object (HSACO) from file, and load it into the device.  */
+
+static void
+load_image (const char *filename)
+{
+  size_t image_size;
+  Elf64_Ehdr *image = (void *) read_file (filename, &image_size);
+
+  /* An "executable" consists of one or more code objects.  */
+  XHSA (hsa_fns.hsa_executable_create_fn (HSA_PROFILE_FULL,
+					  HSA_EXECUTABLE_STATE_UNFROZEN, "",
+					  &executable),
+	"Initialize GCN executable");
+
+  /* Hide relocations from the HSA runtime loader.
+     Keep a copy of the unmodified section headers to use later.  */
+  Elf64_Shdr *image_sections =
+    (Elf64_Shdr *) ((char *) image + image->e_shoff);
+  Elf64_Shdr *sections = malloc (sizeof (Elf64_Shdr) * image->e_shnum);
+  memcpy (sections, image_sections, sizeof (Elf64_Shdr) * image->e_shnum);
+  for (int i = image->e_shnum - 1; i >= 0; i--)
+    {
+      if (image_sections[i].sh_type == SHT_RELA
+	  || image_sections[i].sh_type == SHT_REL)
+	/* Change section type to something harmless.  */
+	image_sections[i].sh_type = SHT_NOTE;
+    }
+
+  /* Add the HSACO to the executable.  */
+  hsa_code_object_t co = { 0 };
+  XHSA (hsa_fns.hsa_code_object_deserialize_fn (image, image_size, NULL, &co),
+	"Deserialize GCN code object");
+  XHSA (hsa_fns.hsa_executable_load_code_object_fn (executable, device, co,
+						    ""),
+	"Load GCN code object");
+
+  /* We're done modifying he executable.  */
+  XHSA (hsa_fns.hsa_executable_freeze_fn (executable, ""),
+	"Freeze GCN executable");
+
+  /* Locate the "main" function, and read the kernel's properties.  */
+  hsa_executable_symbol_t symbol;
+  XHSA (hsa_fns.hsa_executable_get_symbol_fn (executable, NULL, "main",
+					      device, 0, &symbol),
+	"Find 'main' function");
+  XHSA (hsa_fns.hsa_executable_symbol_get_info_fn
+	    (symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel),
+	"Extract kernel object");
+  XHSA (hsa_fns.hsa_executable_symbol_get_info_fn
+	    (symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE,
+	     &kernarg_segment_size),
+	"Extract kernarg segment size");
+  XHSA (hsa_fns.hsa_executable_symbol_get_info_fn
+	    (symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE,
+	     &group_segment_size),
+	"Extract group segment size");
+  XHSA (hsa_fns.hsa_executable_symbol_get_info_fn
+	    (symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE,
+	     &private_segment_size),
+	"Extract private segment size");
+
+  /* Find main function in ELF, and calculate actual load offset.  */
+  Elf64_Addr load_offset;
+  XHSA (hsa_fns.hsa_executable_symbol_get_info_fn
+	    (symbol, HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
+	     &load_offset),
+	"Extract 'main' symbol address");
+  for (int i = 0; i < image->e_shnum; i++)
+    if (sections[i].sh_type == SHT_SYMTAB)
+      {
+	Elf64_Shdr *strtab = &sections[sections[i].sh_link];
+	char *strings = (char *) image + strtab->sh_offset;
+
+	for (size_t offset = 0;
+	     offset < sections[i].sh_size;
+	     offset += sections[i].sh_entsize)
+	  {
+	    Elf64_Sym *sym = (Elf64_Sym *) ((char *) image
+					    + sections[i].sh_offset + offset);
+	    if (strcmp ("main", strings + sym->st_name) == 0)
+	      {
+		load_offset -= sym->st_value;
+		goto found_main;
+	      }
+	  }
+      }
+  /* We only get here when main was not found.
+     This should never happen.  */
+  fprintf (stderr, "Error: main function not found.\n");
+  abort ();
+found_main:;
+
+  /* Find dynamic symbol table.  */
+  Elf64_Shdr *dynsym = NULL;
+  for (int i = 0; i < image->e_shnum; i++)
+    if (sections[i].sh_type == SHT_DYNSYM)
+      {
+	dynsym = &sections[i];
+	break;
+      }
+
+  /* Fix up relocations.  */
+  for (int i = 0; i < image->e_shnum; i++)
+    {
+      if (sections[i].sh_type == SHT_RELA)
+	for (size_t offset = 0;
+	     offset < sections[i].sh_size;
+	     offset += sections[i].sh_entsize)
+	  {
+	    Elf64_Rela *reloc = (Elf64_Rela *) ((char *) image
+						+ sections[i].sh_offset
+						+ offset);
+	    Elf64_Sym *sym =
+	      (dynsym
+	       ? (Elf64_Sym *) ((char *) image
+				+ dynsym->sh_offset
+				+ (dynsym->sh_entsize
+				   * ELF64_R_SYM (reloc->r_info))) : NULL);
+
+	    int64_t S = (sym ? sym->st_value : 0);
+	    int64_t P = reloc->r_offset + load_offset;
+	    int64_t A = reloc->r_addend;
+	    int64_t B = load_offset;
+	    int64_t V, size;
+	    switch (ELF64_R_TYPE (reloc->r_info))
+	      {
+	      case R_AMDGPU_ABS32_LO:
+		V = (S + A) & 0xFFFFFFFF;
+		size = 4;
+		break;
+	      case R_AMDGPU_ABS32_HI:
+		V = (S + A) >> 32;
+		size = 4;
+		break;
+	      case R_AMDGPU_ABS64:
+		V = S + A;
+		size = 8;
+		break;
+	      case R_AMDGPU_REL32:
+		V = S + A - P;
+		size = 4;
+		break;
+	      case R_AMDGPU_REL64:
+		/* FIXME
+		   LLD seems to emit REL64 where the the assembler has ABS64.
+		   This is clearly wrong because it's not what the compiler
+		   is expecting.  Let's assume, for now, that it's a bug.
+		   In any case, GCN kernels are always self contained and
+		   therefore relative relocations will have been resolved
+		   already, so this should be a safe workaround.  */
+		V = S + A /* - P */ ;
+		size = 8;
+		break;
+	      case R_AMDGPU_ABS32:
+		V = S + A;
+		size = 4;
+		break;
+	      /* TODO R_AMDGPU_GOTPCREL */
+	      /* TODO R_AMDGPU_GOTPCREL32_LO */
+	      /* TODO R_AMDGPU_GOTPCREL32_HI */
+	      case R_AMDGPU_REL32_LO:
+		V = (S + A - P) & 0xFFFFFFFF;
+		size = 4;
+		break;
+	      case R_AMDGPU_REL32_HI:
+		V = (S + A - P) >> 32;
+		size = 4;
+		break;
+	      case R_AMDGPU_RELATIVE64:
+		V = B + A;
+		size = 8;
+		break;
+	      default:
+		fprintf (stderr, "Error: unsupported relocation type.\n");
+		exit (1);
+	      }
+	    XHSA (hsa_fns.hsa_memory_copy_fn ((void *) P, &V, size),
+		  "Fix up relocation");
+	  }
+    }
+}
+
+/* Allocate some device memory from the kernargs region.
+   The returned address will be 32-bit (with excess zeroed on 64-bit host),
+   and accessible via the same address on both host and target (via
+   __flat_scalar GCN address space).  */
+
+static void *
+device_malloc (size_t size)
+{
+  void *result;
+  XHSA (hsa_fns.hsa_memory_allocate_fn (kernargs_region, size, &result),
+	"Allocate device memory");
+  return result;
+}
+
+/* These are the device pointers that will be transferred to the target.
+   The HSA Runtime points the kernargs register here.
+   They correspond to function signature:
+       int main (int argc, char *argv[], int *return_value)
+   The compiler expects this, for kernel functions, and will
+   automatically assign the exit value to *return_value.  */
+struct kernargs
+{
+  /* Kernargs.  */
+  int32_t argc;
+  int64_t argv;
+  int64_t out_ptr;
+  int64_t heap_ptr;
+
+  /* Output data.  */
+  struct output
+  {
+    int return_value;
+    int next_output;
+    struct printf_data
+    {
+      int written;
+      char msg[128];
+      int type;
+      union
+      {
+	int64_t ivalue;
+	double dvalue;
+	char text[128];
+      };
+    } queue[1000];
+  } output_data;
+
+  struct heap
+  {
+    int64_t size;
+    char data[0];
+  } heap;
+};
+
+/* Print any console output from the kernel.
+   We print all entries from print_index to the next entry without a "written"
+   flag.  Subsequent calls should use the returned print_index value to resume
+   from the same point.  */
+void
+gomp_print_output (struct kernargs *kernargs, int *print_index)
+{
+  static bool warned_p = false;
+
+  int limit = (sizeof (kernargs->output_data.queue)
+	       / sizeof (kernargs->output_data.queue[0]));
+
+  int i;
+  for (i = *print_index; i < limit; i++)
+    {
+      struct printf_data *data = &kernargs->output_data.queue[i];
+
+      if (!data->written)
+	break;
+
+      switch (data->type)
+	{
+	case 0:
+	  printf ("%.128s%ld\n", data->msg, data->ivalue);
+	  break;
+	case 1:
+	  printf ("%.128s%f\n", data->msg, data->dvalue);
+	  break;
+	case 2:
+	  printf ("%.128s%.128s\n", data->msg, data->text);
+	  break;
+	case 3:
+	  printf ("%.128s%.128s", data->msg, data->text);
+	  break;
+	}
+
+      data->written = 0;
+    }
+
+  if (kernargs->output_data.next_output > limit && !warned_p)
+    {
+      printf ("WARNING: GCN print buffer exhausted.\n");
+      warned_p = true;
+    }
+
+  *print_index = i;
+}
+
+/* Execute an already-loaded kernel on the device.  */
+
+static void
+run (void *kernargs)
+{
+  /* A "signal" is used to launch and monitor the kernel.  */
+  hsa_signal_t signal;
+  XHSA (hsa_fns.hsa_signal_create_fn (1, 0, NULL, &signal),
+	"Create signal");
+
+  /* Configure for a single-worker kernel.  */
+  uint64_t index = hsa_fns.hsa_queue_load_write_index_relaxed_fn (queue);
+  const uint32_t queueMask = queue->size - 1;
+  hsa_kernel_dispatch_packet_t *dispatch_packet =
+    &(((hsa_kernel_dispatch_packet_t *) (queue->base_address))[index &
+							       queueMask]);
+  dispatch_packet->setup |= 3 << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+  dispatch_packet->workgroup_size_x = (uint16_t) 1;
+  dispatch_packet->workgroup_size_y = (uint16_t) 64;
+  dispatch_packet->workgroup_size_z = (uint16_t) 1;
+  dispatch_packet->grid_size_x = 1;
+  dispatch_packet->grid_size_y = 64;
+  dispatch_packet->grid_size_z = 1;
+  dispatch_packet->completion_signal = signal;
+  dispatch_packet->kernel_object = kernel;
+  dispatch_packet->kernarg_address = (void *) kernargs;
+  dispatch_packet->private_segment_size = private_segment_size;
+  dispatch_packet->group_segment_size = group_segment_size;
+
+  uint16_t header = 0;
+  header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE;
+  header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;
+  header |= HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE;
+
+  __atomic_store_n ((uint32_t *) dispatch_packet,
+		    header | (dispatch_packet->setup << 16),
+		    __ATOMIC_RELEASE);
+
+  if (debug)
+    fprintf (stderr, "Launch kernel\n");
+
+  hsa_fns.hsa_queue_store_write_index_relaxed_fn (queue, index + 1);
+  hsa_fns.hsa_signal_store_relaxed_fn (queue->doorbell_signal, index);
+  /* Kernel running ......  */
+  int print_index = 0;
+  while (hsa_fns.hsa_signal_wait_relaxed_fn (signal, HSA_SIGNAL_CONDITION_LT,
+					     1, 1000000,
+					     HSA_WAIT_STATE_ACTIVE) != 0)
+    {
+      usleep (10000);
+      gomp_print_output (kernargs, &print_index);
+    }
+
+  gomp_print_output (kernargs, &print_index);
+
+  if (debug)
+    fprintf (stderr, "Kernel exited\n");
+
+  XHSA (hsa_fns.hsa_signal_destroy_fn (signal),
+	"Clean up signal");
+}
+
+int
+main (int argc, char *argv[])
+{
+  int kernel_arg = 0;
+  for (int i = 1; i < argc; i++)
+    {
+      if (!strcmp (argv[i], "--help"))
+	{
+	  usage (argv[0]);
+	  return 0;
+	}
+      else if (!strcmp (argv[i], "--version"))
+	{
+	  version (argv[0]);
+	  return 0;
+	}
+      else if (!strcmp (argv[i], "--debug"))
+	debug = true;
+      else if (argv[i][0] == '-')
+	{
+	  usage (argv[0]);
+	  return 1;
+	}
+      else
+	{
+	  kernel_arg = i;
+	  break;
+	}
+    }
+
+  if (!kernel_arg)
+    {
+      /* No kernel arguments were found.  */
+      usage (argv[0]);
+      return 1;
+    }
+
+  /* The remaining arguments are for the GCN kernel.  */
+  int kernel_argc = argc - kernel_arg;
+  char **kernel_argv = &argv[kernel_arg];
+
+  init_device ();
+  load_image (kernel_argv[0]);
+
+  /* Calculate size of function parameters + argv data.  */
+  size_t args_size = 0;
+  for (int i = 0; i < kernel_argc; i++)
+    args_size += strlen (kernel_argv[i]) + 1;
+
+  /* Allocate device memory for both function parameters and the argv
+     data.  */
+  size_t heap_size = 10 * 1024 * 1024;	/* 10MB.  */
+  struct kernargs *kernargs = device_malloc (sizeof (*kernargs) + heap_size);
+  struct argdata
+  {
+    int64_t argv_data[kernel_argc];
+    char strings[args_size];
+  } *args = device_malloc (sizeof (struct argdata));
+
+  /* Write the data to the target.  */
+  kernargs->argc = kernel_argc;
+  kernargs->argv = (int64_t) args->argv_data;
+  kernargs->out_ptr = (int64_t) &kernargs->output_data;
+  kernargs->output_data.return_value = 0xcafe0000; /* Default return value. */
+  kernargs->output_data.next_output = 0;
+  for (unsigned i = 0; i < (sizeof (kernargs->output_data.queue)
+			    / sizeof (kernargs->output_data.queue[0])); i++)
+    kernargs->output_data.queue[i].written = 0;
+  int offset = 0;
+  for (int i = 0; i < kernel_argc; i++)
+    {
+      size_t arg_len = strlen (kernel_argv[i]) + 1;
+      args->argv_data[i] = (int64_t) &args->strings[offset];
+      memcpy (&args->strings[offset], kernel_argv[i], arg_len + 1);
+      offset += arg_len;
+    }
+  kernargs->heap_ptr = (int64_t) &kernargs->heap;
+  kernargs->heap.size = heap_size;
+
+  /* Run the kernel on the GPU.  */
+  run (kernargs);
+  unsigned int return_value =
+    (unsigned int) kernargs->output_data.return_value;
+
+  unsigned int upper = (return_value & ~0xffff) >> 16;
+  if (upper == 0xcafe)
+    printf ("Kernel exit value was never set\n");
+  else if (upper == 0xffff)
+    ; /* Set by exit.  */
+  else if (upper == 0)
+    ; /* Set by return from main.  */
+  else
+    printf ("Possible kernel exit value corruption, 2 most significant bytes "
+	    "aren't 0xffff, 0xcafe, or 0: 0x%x\n", return_value);
+
+  if (upper == 0xffff)
+    {
+      unsigned int signal = (return_value >> 8) & 0xff;
+      if (signal == SIGABRT)
+	printf ("Kernel aborted\n");
+      else if (signal != 0)
+	printf ("Kernel received unkown signal\n");
+    }
+
+  if (debug)
+    printf ("Kernel exit value: %d\n", return_value & 0xff);
+
+  /* Clean shut down.  */
+  XHSA (hsa_fns.hsa_memory_free_fn (kernargs),
+	"Clean up device memory");
+  XHSA (hsa_fns.hsa_executable_destroy_fn (executable),
+	"Clean up GCN executable");
+  XHSA (hsa_fns.hsa_queue_destroy_fn (queue),
+	"Clean up device queue");
+  XHSA (hsa_fns.hsa_shut_down_fn (),
+	"Shut down run-time");
+
+  return return_value & 0xff;
+}
diff --git a/gcc/config/gcn/gcn-tree.c b/gcc/config/gcn/gcn-tree.c
new file mode 100644
index 0000000..0365baf
--- /dev/null
+++ b/gcc/config/gcn/gcn-tree.c
@@ -0,0 +1,715 @@
+/* Copyright (C) 2017-2018 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+   
+   GCC is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3, or (at your option) any later
+   version.
+   
+   GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+   
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+/* {{{ Includes.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "gimple-iterator.h"
+#include "cfghooks.h"
+#include "cfgloop.h"
+#include "tm_p.h"
+#include "stringpool.h"
+#include "fold-const.h"
+#include "varasm.h"
+#include "omp-low.h"
+#include "omp-general.h"
+#include "internal-fn.h"
+#include "tree-vrp.h"
+#include "tree-ssanames.h"
+#include "tree-ssa-operands.h"
+#include "gimplify.h"
+#include "tree-phinodes.h"
+#include "cgraph.h"
+#include "targhooks.h"
+#include "langhooks-def.h"
+
+/* }}}  */
+/* {{{ OMP GCN pass.  */
+
+unsigned int
+execute_omp_gcn (void)
+{
+  tree thr_num_tree = builtin_decl_explicit (BUILT_IN_OMP_GET_THREAD_NUM);
+  tree thr_num_id = DECL_NAME (thr_num_tree);
+  tree team_num_tree = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
+  tree team_num_id = DECL_NAME (team_num_tree);
+  basic_block bb;
+  gimple_stmt_iterator gsi;
+  unsigned int todo = 0;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *call = gsi_stmt (gsi);
+      tree decl;
+
+      if (is_gimple_call (call) && (decl = gimple_call_fndecl (call)))
+	{
+	  tree decl_id = DECL_NAME (decl);
+	  tree lhs = gimple_get_lhs (call);
+
+	  if (decl_id == thr_num_id)
+	    {
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file,
+			 "Replace '%s' with __builtin_gcn_dim_pos.\n",
+			 IDENTIFIER_POINTER (decl_id));
+
+	      /* Transform this:
+	         lhs = __builtin_omp_get_thread_num ()
+	         to this:
+	         lhs = __builtin_gcn_dim_pos (1)  */
+	      tree fn = targetm.builtin_decl (GCN_BUILTIN_OMP_DIM_POS, 0);
+	      tree fnarg = build_int_cst (unsigned_type_node, 1);
+	      gimple *stmt = gimple_build_call (fn, 1, fnarg);
+	      gimple_call_set_lhs (stmt, lhs);
+	      gsi_replace (&gsi, stmt, true);
+
+	      todo |= TODO_update_ssa;
+	    }
+	  else if (decl_id == team_num_id)
+	    {
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file,
+			 "Replace '%s' with __builtin_gcn_dim_pos.\n",
+			 IDENTIFIER_POINTER (decl_id));
+
+	      /* Transform this:
+	         lhs = __builtin_omp_get_team_num ()
+	         to this:
+	         lhs = __builtin_gcn_dim_pos (0)  */
+	      tree fn = targetm.builtin_decl (GCN_BUILTIN_OMP_DIM_POS, 0);
+	      tree fnarg = build_zero_cst (unsigned_type_node);
+	      gimple *stmt = gimple_build_call (fn, 1, fnarg);
+	      gimple_call_set_lhs (stmt, lhs);
+	      gsi_replace (&gsi, stmt, true);
+
+	      todo |= TODO_update_ssa;
+	    }
+	}
+    }
+
+  return todo;
+}
+
+namespace
+{
+
+  const pass_data pass_data_omp_gcn = {
+    GIMPLE_PASS,
+    "omp_gcn",			/* name */
+    OPTGROUP_NONE,		/* optinfo_flags */
+    TV_NONE,			/* tv_id */
+    0,				/* properties_required */
+    0,				/* properties_provided */
+    0,				/* properties_destroyed */
+    0,				/* todo_flags_start */
+    TODO_df_finish,		/* todo_flags_finish */
+  };
+
+  class pass_omp_gcn : public gimple_opt_pass
+  {
+  public:
+    pass_omp_gcn (gcc::context *ctxt)
+      : gimple_opt_pass (pass_data_omp_gcn, ctxt)
+    {
+    }
+
+    /* opt_pass methods: */
+    virtual bool gate (function *)
+    {
+      return flag_openmp;
+    }
+
+    virtual unsigned int execute (function *)
+    {
+      return execute_omp_gcn ();
+    }
+
+  }; /* class pass_omp_gcn.  */
+
+} /* anon namespace.  */
+
+gimple_opt_pass *
+make_pass_omp_gcn (gcc::context *ctxt)
+{
+  return new pass_omp_gcn (ctxt);
+}
+
+/* }}}  */
+/* {{{ OpenACC reductions.  */
+
+/* Global lock variable, needed for 128bit worker & gang reductions.  */
+
+static GTY(()) tree global_lock_var;
+
+/* Lazily generate the global_lock_var decl and return its address.  */
+
+static tree
+gcn_global_lock_addr ()
+{
+  tree v = global_lock_var;
+
+  if (!v)
+    {
+      tree name = get_identifier ("__reduction_lock");
+      tree type = build_qualified_type (unsigned_type_node,
+					TYPE_QUAL_VOLATILE);
+      v = build_decl (BUILTINS_LOCATION, VAR_DECL, name, type);
+      global_lock_var = v;
+      DECL_ARTIFICIAL (v) = 1;
+      DECL_EXTERNAL (v) = 1;
+      TREE_STATIC (v) = 1;
+      TREE_PUBLIC (v) = 1;
+      TREE_USED (v) = 1;
+      mark_addressable (v);
+      mark_decl_referenced (v);
+    }
+
+  return build_fold_addr_expr (v);
+}
+
+/* Helper function for gcn_reduction_update.
+
+   Insert code to locklessly update *PTR with *PTR OP VAR just before
+   GSI.  We use a lockless scheme for nearly all case, which looks
+   like:
+     actual = initval (OP);
+     do {
+       guess = actual;
+       write = guess OP myval;
+       actual = cmp&swap (ptr, guess, write)
+     } while (actual bit-different-to guess);
+   return write;
+
+   This relies on a cmp&swap instruction, which is available for 32- and
+   64-bit types.  Larger types must use a locking scheme.  */
+
+static tree
+gcn_lockless_update (location_t loc, gimple_stmt_iterator *gsi,
+		     tree ptr, tree var, tree_code op)
+{
+  unsigned fn = GCN_BUILTIN_CMP_SWAP;
+  tree_code code = NOP_EXPR;
+  tree arg_type = unsigned_type_node;
+  tree var_type = TREE_TYPE (var);
+
+  if (TREE_CODE (var_type) == COMPLEX_TYPE
+      || TREE_CODE (var_type) == REAL_TYPE)
+    code = VIEW_CONVERT_EXPR;
+
+  if (TYPE_SIZE (var_type) == TYPE_SIZE (long_long_unsigned_type_node))
+    {
+      arg_type = long_long_unsigned_type_node;
+      fn = GCN_BUILTIN_CMP_SWAPLL;
+    }
+
+  tree swap_fn = gcn_builtin_decl (fn, true);
+
+  gimple_seq init_seq = NULL;
+  tree init_var = make_ssa_name (arg_type);
+  tree init_expr = omp_reduction_init_op (loc, op, var_type);
+  init_expr = fold_build1 (code, arg_type, init_expr);
+  gimplify_assign (init_var, init_expr, &init_seq);
+  gimple *init_end = gimple_seq_last (init_seq);
+
+  gsi_insert_seq_before (gsi, init_seq, GSI_SAME_STMT);
+
+  /* Split the block just after the init stmts.  */
+  basic_block pre_bb = gsi_bb (*gsi);
+  edge pre_edge = split_block (pre_bb, init_end);
+  basic_block loop_bb = pre_edge->dest;
+  pre_bb = pre_edge->src;
+  /* Reset the iterator.  */
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  tree expect_var = make_ssa_name (arg_type);
+  tree actual_var = make_ssa_name (arg_type);
+  tree write_var = make_ssa_name (arg_type);
+
+  /* Build and insert the reduction calculation.  */
+  gimple_seq red_seq = NULL;
+  tree write_expr = fold_build1 (code, var_type, expect_var);
+  write_expr = fold_build2 (op, var_type, write_expr, var);
+  write_expr = fold_build1 (code, arg_type, write_expr);
+  gimplify_assign (write_var, write_expr, &red_seq);
+
+  gsi_insert_seq_before (gsi, red_seq, GSI_SAME_STMT);
+
+  /* Build & insert the cmp&swap sequence.  */
+  gimple_seq latch_seq = NULL;
+  tree swap_expr = build_call_expr_loc (loc, swap_fn, 3,
+					ptr, expect_var, write_var);
+  gimplify_assign (actual_var, swap_expr, &latch_seq);
+
+  gcond *cond = gimple_build_cond (EQ_EXPR, actual_var, expect_var,
+				   NULL_TREE, NULL_TREE);
+  gimple_seq_add_stmt (&latch_seq, cond);
+
+  gimple *latch_end = gimple_seq_last (latch_seq);
+  gsi_insert_seq_before (gsi, latch_seq, GSI_SAME_STMT);
+
+  /* Split the block just after the latch stmts.  */
+  edge post_edge = split_block (loop_bb, latch_end);
+  basic_block post_bb = post_edge->dest;
+  loop_bb = post_edge->src;
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  post_edge->flags ^= EDGE_TRUE_VALUE | EDGE_FALLTHRU;
+  /* post_edge->probability = profile_probability::even ();  */
+  edge loop_edge = make_edge (loop_bb, loop_bb, EDGE_FALSE_VALUE);
+  /* loop_edge->probability = profile_probability::even ();  */
+  set_immediate_dominator (CDI_DOMINATORS, loop_bb, pre_bb);
+  set_immediate_dominator (CDI_DOMINATORS, post_bb, loop_bb);
+
+  gphi *phi = create_phi_node (expect_var, loop_bb);
+  add_phi_arg (phi, init_var, pre_edge, loc);
+  add_phi_arg (phi, actual_var, loop_edge, loc);
+
+  loop *loop = alloc_loop ();
+  loop->header = loop_bb;
+  loop->latch = loop_bb;
+  add_loop (loop, loop_bb->loop_father);
+
+  return fold_build1 (code, var_type, write_var);
+}
+
+/* Helper function for gcn_reduction_update.
+   
+   Insert code to lockfully update *PTR with *PTR OP VAR just before
+   GSI.  This is necessary for types larger than 64 bits, where there
+   is no cmp&swap instruction to implement a lockless scheme.  We use
+   a lock variable in global memory.
+
+   while (cmp&swap (&lock_var, 0, 1))
+     continue;
+   T accum = *ptr;
+   accum = accum OP var;
+   *ptr = accum;
+   cmp&swap (&lock_var, 1, 0);
+   return accum;
+
+   A lock in global memory is necessary to force execution engine
+   descheduling and avoid resource starvation that can occur if the
+   lock is in shared memory.  */
+
+static tree
+gcn_lockfull_update (location_t loc, gimple_stmt_iterator *gsi,
+		     tree ptr, tree var, tree_code op)
+{
+  tree var_type = TREE_TYPE (var);
+  tree swap_fn = gcn_builtin_decl (GCN_BUILTIN_CMP_SWAP, true);
+  tree uns_unlocked = build_int_cst (unsigned_type_node, 0);
+  tree uns_locked = build_int_cst (unsigned_type_node, 1);
+
+  /* Split the block just before the gsi.  Insert a gimple nop to make
+     this easier.  */
+  gimple *nop = gimple_build_nop ();
+  gsi_insert_before (gsi, nop, GSI_SAME_STMT);
+  basic_block entry_bb = gsi_bb (*gsi);
+  edge entry_edge = split_block (entry_bb, nop);
+  basic_block lock_bb = entry_edge->dest;
+  /* Reset the iterator.  */
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  /* Build and insert the locking sequence.  */
+  gimple_seq lock_seq = NULL;
+  tree lock_var = make_ssa_name (unsigned_type_node);
+  tree lock_expr = gcn_global_lock_addr ();
+  lock_expr = build_call_expr_loc (loc, swap_fn, 3, lock_expr,
+				   uns_unlocked, uns_locked);
+  gimplify_assign (lock_var, lock_expr, &lock_seq);
+  gcond *cond = gimple_build_cond (EQ_EXPR, lock_var, uns_unlocked,
+				   NULL_TREE, NULL_TREE);
+  gimple_seq_add_stmt (&lock_seq, cond);
+  gimple *lock_end = gimple_seq_last (lock_seq);
+  gsi_insert_seq_before (gsi, lock_seq, GSI_SAME_STMT);
+
+  /* Split the block just after the lock sequence.  */
+  edge locked_edge = split_block (lock_bb, lock_end);
+  basic_block update_bb = locked_edge->dest;
+  lock_bb = locked_edge->src;
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  /* Create the lock loop.  */
+  locked_edge->flags ^= EDGE_TRUE_VALUE | EDGE_FALLTHRU;
+  locked_edge->probability = profile_probability::even ();
+  edge loop_edge = make_edge (lock_bb, lock_bb, EDGE_FALSE_VALUE);
+  loop_edge->probability = profile_probability::even ();
+  set_immediate_dominator (CDI_DOMINATORS, lock_bb, entry_bb);
+  set_immediate_dominator (CDI_DOMINATORS, update_bb, lock_bb);
+
+  /* Create the loop structure.  */
+  loop *lock_loop = alloc_loop ();
+  lock_loop->header = lock_bb;
+  lock_loop->latch = lock_bb;
+  lock_loop->nb_iterations_estimate = 1;
+  lock_loop->any_estimate = true;
+  add_loop (lock_loop, entry_bb->loop_father);
+
+  /* Build and insert the reduction calculation.  */
+  gimple_seq red_seq = NULL;
+  tree acc_in = make_ssa_name (var_type);
+  tree ref_in = build_simple_mem_ref (ptr);
+  TREE_THIS_VOLATILE (ref_in) = 1;
+  gimplify_assign (acc_in, ref_in, &red_seq);
+
+  tree acc_out = make_ssa_name (var_type);
+  tree update_expr = fold_build2 (op, var_type, ref_in, var);
+  gimplify_assign (acc_out, update_expr, &red_seq);
+
+  tree ref_out = build_simple_mem_ref (ptr);
+  TREE_THIS_VOLATILE (ref_out) = 1;
+  gimplify_assign (ref_out, acc_out, &red_seq);
+
+  gsi_insert_seq_before (gsi, red_seq, GSI_SAME_STMT);
+
+  /* Build & insert the unlock sequence.  */
+  gimple_seq unlock_seq = NULL;
+  tree unlock_expr = gcn_global_lock_addr ();
+  unlock_expr = build_call_expr_loc (loc, swap_fn, 3, unlock_expr,
+				     uns_locked, uns_unlocked);
+  gimplify_and_add (unlock_expr, &unlock_seq);
+  gsi_insert_seq_before (gsi, unlock_seq, GSI_SAME_STMT);
+
+  return acc_out;
+}
+
+/* Emit a sequence to update a reduction accumulator at *PTR with the
+   value held in VAR using operator OP.  Return the updated value.
+
+   TODO: optimize for atomic ops and independent complex ops.  */
+
+static tree
+gcn_reduction_update (location_t loc, gimple_stmt_iterator *gsi,
+		      tree ptr, tree var, tree_code op)
+{
+  tree type = TREE_TYPE (var);
+  tree size = TYPE_SIZE (type);
+
+  if (size == TYPE_SIZE (unsigned_type_node)
+      || size == TYPE_SIZE (long_long_unsigned_type_node))
+    return gcn_lockless_update (loc, gsi, ptr, var, op);
+  else
+    return gcn_lockfull_update (loc, gsi, ptr, var, op);
+}
+
+/* Return a temporary variable decl to use for an OpenACC worker reduction.  */
+
+static tree
+gcn_goacc_get_worker_red_decl (tree type, unsigned offset)
+{
+  machine_function *machfun = cfun->machine;
+  tree existing_decl;
+
+  if (TREE_CODE (type) == REFERENCE_TYPE)
+    type = TREE_TYPE (type);
+
+  tree var_type
+    = build_qualified_type (type,
+			    (TYPE_QUALS (type)
+			     | ENCODE_QUAL_ADDR_SPACE (ADDR_SPACE_LDS)));
+
+  if (machfun->reduc_decls
+      && offset < machfun->reduc_decls->length ()
+      && (existing_decl = (*machfun->reduc_decls)[offset]))
+    {
+      gcc_assert (TREE_TYPE (existing_decl) == var_type);
+      return existing_decl;
+    }
+  else
+    {
+      char name[50];
+      sprintf (name, ".oacc_reduction_%u", offset);
+      tree decl = create_tmp_var_raw (var_type, name);
+
+      DECL_CONTEXT (decl) = NULL_TREE;
+      TREE_STATIC (decl) = 1;
+
+      varpool_node::finalize_decl (decl);
+
+      vec_safe_grow_cleared (machfun->reduc_decls, offset + 1);
+      (*machfun->reduc_decls)[offset] = decl;
+
+      return decl;
+    }
+
+  return NULL_TREE;
+}
+
+/* Expand IFN_GOACC_REDUCTION_SETUP.  */
+
+static void
+gcn_goacc_reduction_setup (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Copy the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	var = build_simple_mem_ref (ref_to_res);
+    }
+
+  if (level == GOMP_DIM_WORKER)
+    {
+      tree var_type = TREE_TYPE (var);
+      /* Store incoming value to worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree decl
+	= gcn_goacc_get_worker_red_decl (var_type, TREE_INT_CST_LOW (offset));
+
+      gimplify_assign (decl, var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* Expand IFN_GOACC_REDUCTION_INIT.  */
+
+static void
+gcn_goacc_reduction_init (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code rcode
+    = (enum tree_code) TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  tree init = omp_reduction_init_op (gimple_location (call), rcode,
+				     TREE_TYPE (var));
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_GANG)
+    {
+      /* If there's no receiver object, propagate the incoming VAR.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+      if (integer_zerop (ref_to_res))
+	init = var;
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, init, &seq);
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* Expand IFN_GOACC_REDUCTION_FINI.  */
+
+static void
+gcn_goacc_reduction_fini (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 1);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code op
+    = (enum tree_code) TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  gimple_seq seq = NULL;
+  tree r = NULL_TREE;;
+
+  push_gimplify_context (true);
+
+  tree accum = NULL_TREE;
+
+  if (level == GOMP_DIM_WORKER)
+    {
+      tree var_type = TREE_TYPE (var);
+      tree offset = gimple_call_arg (call, 5);
+      tree decl
+	= gcn_goacc_get_worker_red_decl (var_type, TREE_INT_CST_LOW (offset));
+
+      accum = build_fold_addr_expr (decl);
+    }
+  else if (integer_zerop (ref_to_res))
+    r = var;
+  else
+    accum = ref_to_res;
+
+  if (accum)
+    {
+      /* UPDATE the accumulator.  */
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+      seq = NULL;
+      r = gcn_reduction_update (gimple_location (call), &gsi, accum, var, op);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, r, &seq);
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* Expand IFN_GOACC_REDUCTION_TEARDOWN.  */
+
+static void
+gcn_goacc_reduction_teardown (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_WORKER)
+    {
+      tree var_type = TREE_TYPE (var);
+
+      /* Read the worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree decl
+	= gcn_goacc_get_worker_red_decl (var_type, TREE_INT_CST_LOW (offset));
+      var = decl;
+    }
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Write to the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	gimplify_assign (build_simple_mem_ref (ref_to_res), var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* Implement TARGET_GOACC_REDUCTION.
+ 
+   Expand calls to the GOACC REDUCTION internal function, into a sequence of
+   gimple instructions.  */
+
+void
+gcn_goacc_reduction (gcall *call)
+{
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+
+  if (level == GOMP_DIM_VECTOR)
+    {
+      default_goacc_reduction (call);
+      return;
+    }
+
+  unsigned code = (unsigned) TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+
+  switch (code)
+    {
+    case IFN_GOACC_REDUCTION_SETUP:
+      gcn_goacc_reduction_setup (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_INIT:
+      gcn_goacc_reduction_init (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_FINI:
+      gcn_goacc_reduction_fini (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_TEARDOWN:
+      gcn_goacc_reduction_teardown (call);
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Implement TARGET_GOACC_ADJUST_PROPAGATION_RECORD.
+ 
+   Tweak (worker) propagation record, e.g. to put it in shared memory.  */
+
+tree
+gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
+				     const char *name)
+{
+  tree type = record_type;
+
+  TYPE_ADDR_SPACE (type) = ADDR_SPACE_LDS;
+
+  if (!sender)
+    type = build_pointer_type (type);
+
+  tree decl = create_tmp_var_raw (type, name);
+
+  if (sender)
+    {
+      DECL_CONTEXT (decl) = NULL_TREE;
+      TREE_STATIC (decl) = 1;
+    }
+
+  if (sender)
+    varpool_node::finalize_decl (decl);
+
+  return decl;
+}
+
+void
+gcn_goacc_adjust_gangprivate_decl (tree var)
+{
+  tree type = TREE_TYPE (var);
+  tree lds_type = build_qualified_type (type,
+		    TYPE_QUALS_NO_ADDR_SPACE (type)
+		    | ENCODE_QUAL_ADDR_SPACE (ADDR_SPACE_LDS));
+  machine_function *machfun = cfun->machine;
+
+  TREE_TYPE (var) = lds_type;
+  TREE_STATIC (var) = 1;
+
+  /* We're making VAR static.  We have to mangle the name to avoid collisions
+     between different local variables that share the same names.  */
+  lhd_set_decl_assembler_name (var);
+
+  varpool_node::finalize_decl (var);
+
+  if (machfun)
+    machfun->use_flat_addressing = true;
+}
+
+/* }}}  */
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
new file mode 100644
index 0000000..0531c4f
--- /dev/null
+++ b/gcc/config/gcn/gcn-valu.md
@@ -0,0 +1,3509 @@
+;; Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+;; This file is free software; you can redistribute it and/or modify it under
+;; the terms of the GNU General Public License as published by the Free
+;; Software Foundation; either version 3 of the License, or (at your option)
+;; any later version.
+
+;; This file is distributed in the hope that it will be useful, but WITHOUT
+;; ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+;; FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+;; for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING3.  If not see
+;; <http://www.gnu.org/licenses/>.
+
+;; {{{ Vector iterators
+
+; Vector modes for one vector register
+(define_mode_iterator VEC_1REG_MODE
+		      [V64QI V64HI V64SI V64HF V64SF])
+(define_mode_iterator VEC_1REG_ALT
+		      [V64QI V64HI V64SI V64HF V64SF])
+
+(define_mode_iterator VEC_1REG_INT_MODE
+		      [V64QI V64HI V64SI])
+(define_mode_iterator VEC_1REG_INT_ALT
+		      [V64QI V64HI V64SI])
+
+(define_mode_iterator SCALAR_1REG_INT_MODE
+		      [QI HI SI])
+
+; Vector modes for two vector registers
+(define_mode_iterator VEC_2REG_MODE
+		      [V64DI V64DF])
+
+; All of above
+(define_mode_iterator VEC_REG_MODE
+		      [V64QI V64HI V64SI V64HF V64SF    ; Single reg
+		       V64DI V64DF])		        ; Double reg
+
+(define_mode_attr scalar_mode
+  [(V64QI "qi") (V64HI "hi") (V64SI "si")
+   (V64HF "hf") (V64SF "sf") (V64DI "di") (V64DF "df")])
+
+(define_mode_attr SCALAR_MODE
+  [(V64QI "QI") (V64HI "HI") (V64SI "SI")
+   (V64HF "HF") (V64SF "SF") (V64DI "DI") (V64DF "DF")])
+
+;; }}}
+;; {{{ Vector moves
+
+; This is the entry point for all vector register moves.  Memory accesses can
+; come this way also, but will more usually use the reload_in/out,
+; gather/scatter, maskload/store, etc.
+
+(define_expand "mov<mode>"
+  [(set (match_operand:VEC_REG_MODE 0 "nonimmediate_operand")
+	(match_operand:VEC_REG_MODE 1 "general_operand"))]
+  ""
+  {
+    /* Do not attempt to move unspec vectors.  */
+    if (GET_CODE (operands[1]) == UNSPEC
+	&& XINT (operands[1], 1) == UNSPEC_VECTOR)
+      FAIL;
+
+    if (can_create_pseudo_p ())
+      {
+        rtx exec = gcn_full_exec_reg ();
+	rtx undef = gcn_gen_undef (<MODE>mode);
+
+	if (MEM_P (operands[0]))
+	  {
+	    operands[1] = force_reg (<MODE>mode, operands[1]);
+	    rtx scratch = gen_rtx_SCRATCH (V64DImode);
+	    rtx a = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[0]));
+	    rtx v = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[0]));
+	    rtx expr = gcn_expand_scalar_to_vector_address (<MODE>mode, exec,
+							    operands[0],
+							    scratch);
+	    emit_insn (gen_scatter<mode>_expr (expr, operands[1], a, v, exec));
+	  }
+	else if (MEM_P (operands[1]))
+	  {
+	    rtx scratch = gen_rtx_SCRATCH (V64DImode);
+	    rtx a = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[1]));
+	    rtx v = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));
+	    rtx expr = gcn_expand_scalar_to_vector_address (<MODE>mode, exec,
+							    operands[1],
+							    scratch);
+	    emit_insn (gen_gather<mode>_expr (operands[0], expr, a, v, undef,
+					      exec));
+	  }
+	else
+	  emit_insn (gen_mov<mode>_vector (operands[0], operands[1], exec,
+					   undef));
+
+	DONE;
+      }
+  })
+
+; A vector move that does not reference EXEC explicitly, and therefore is
+; suitable for use during or after LRA.  It uses the "exec" attribure instead.
+
+(define_insn "mov<mode>_full"
+  [(set (match_operand:VEC_1REG_MODE 0 "nonimmediate_operand" "=v,v")
+	(match_operand:VEC_1REG_MODE 1 "general_operand"      "vA,B"))]
+  "lra_in_progress || reload_completed"
+  "v_mov_b32\t%0, %1"
+  [(set_attr "type" "vop1,vop1")
+   (set_attr "length" "4,8")
+   (set_attr "exec" "full")])
+
+(define_insn "mov<mode>_full"
+  [(set (match_operand:VEC_2REG_MODE 0 "nonimmediate_operand"  "=v")
+	(match_operand:VEC_2REG_MODE 1 "general_operand"      "vDB"))]
+  "lra_in_progress || reload_completed"
+  {
+    if (!REG_P (operands[1]) || REGNO (operands[0]) <= REGNO (operands[1]))
+      return "v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1";
+    else
+      return "v_mov_b32\t%H0, %H1\;v_mov_b32\t%L0, %L1";
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "16")
+   (set_attr "exec" "full")])
+
+; A SGPR-base load looks like:
+;   <load> v, Sg
+;
+; There's no hardware instruction that corresponds to this, but vector base
+; addresses are placed in an SGPR because it is easier to add to a vector.
+; We also have a temporary vT, and the vector v1 holding numbered lanes.
+;
+; Rewrite as:
+;   vT = v1 << log2(element-size)
+;   vT += Sg
+;   flat_load v, vT
+
+(define_insn "mov<mode>_sgprbase"
+  [(set (match_operand:VEC_1REG_MODE 0 "nonimmediate_operand" "= v, v, v, m")
+	(unspec:VEC_1REG_MODE
+	  [(match_operand:VEC_1REG_MODE 1 "general_operand"   " vA,vB, m, v")]
+	  UNSPEC_SGPRBASE))
+   (clobber (match_operand:V64DI 2 "register_operand"	      "=&v,&v,&v,&v"))]
+  "lra_in_progress || reload_completed"
+  "@
+   v_mov_b32\t%0, %1
+   v_mov_b32\t%0, %1
+   #
+   #"
+  [(set_attr "type" "vop1,vop1,*,*")
+   (set_attr "length" "4,8,12,12")
+   (set_attr "exec" "full")])
+
+(define_insn "mov<mode>_sgprbase"
+  [(set (match_operand:VEC_2REG_MODE 0 "nonimmediate_operand" "= v, v, m")
+	(unspec:VEC_2REG_MODE
+	  [(match_operand:VEC_2REG_MODE 1 "general_operand"   "vDB, m, v")]
+	  UNSPEC_SGPRBASE))
+   (clobber (match_operand:V64DI 2 "register_operand"	      "=&v,&v,&v"))]
+  "lra_in_progress || reload_completed"
+  "@
+   * if (!REG_P (operands[1]) || REGNO (operands[0]) <= REGNO (operands[1])) \
+       return \"v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1\"; \
+     else \
+       return \"v_mov_b32\t%H0, %H1\;v_mov_b32\t%L0, %L1\";
+   #
+   #"
+  [(set_attr "type" "vmult,*,*")
+   (set_attr "length" "8,12,12")
+   (set_attr "exec" "full")])
+
+; reload_in was once a standard name, but here it's only referenced by
+; gcn_secondary_reload.  It allows a reload with a scratch register.
+
+(define_expand "reload_in<mode>"
+  [(set (match_operand:VEC_REG_MODE 0 "register_operand" "= v")
+	(match_operand:VEC_REG_MODE 1 "memory_operand"   "  m"))
+   (clobber (match_operand:V64DI 2 "register_operand"    "=&v"))]
+  ""
+  {
+    emit_insn (gen_mov<mode>_sgprbase (operands[0], operands[1], operands[2]));
+    DONE;
+  })
+
+; reload_out is similar to reload_in, above.
+
+(define_expand "reload_out<mode>"
+  [(set (match_operand:VEC_REG_MODE 0 "memory_operand"   "= m")
+	(match_operand:VEC_REG_MODE 1 "register_operand" "  v"))
+   (clobber (match_operand:V64DI 2 "register_operand"    "=&v"))]
+  ""
+  {
+    emit_insn (gen_mov<mode>_sgprbase (operands[0], operands[1], operands[2]));
+    DONE;
+  })
+
+; This is the 'normal' kind of vector move created before register allocation.
+
+(define_insn "mov<mode>_vector"
+  [(set (match_operand:VEC_1REG_MODE 0 "nonimmediate_operand"
+							 "=v, v, v, v, v, m")
+        (vec_merge:VEC_1REG_MODE
+	  (match_operand:VEC_1REG_MODE 1 "general_operand"
+							 "vA, B, v,vA, m, v")
+	  (match_operand:VEC_1REG_MODE 3 "gcn_alu_or_unspec_operand"
+							 "U0,U0,vA,vA,U0,U0")
+	  (match_operand:DI 2 "register_operand"	 " e, e,cV,Sg, e, e")))
+   (clobber (match_scratch:V64DI 4			 "=X, X, X, X,&v,&v"))]
+  "!MEM_P (operands[0]) || REG_P (operands[1])"
+  "@
+   v_mov_b32\t%0, %1
+   v_mov_b32\t%0, %1
+   v_cndmask_b32\t%0, %3, %1, vcc
+   v_cndmask_b32\t%0, %3, %1, %2
+   #
+   #"
+  [(set_attr "type" "vop1,vop1,vop2,vop3a,*,*")
+   (set_attr "length" "4,8,4,8,16,16")
+   (set_attr "exec" "*,*,full,full,*,*")])
+
+; This variant does not accept an unspec, but does permit MEM
+; read/modify/write which is necessary for maskstore.
+
+(define_insn "*mov<mode>_vector_match"
+  [(set (match_operand:VEC_1REG_MODE 0 "nonimmediate_operand" "=v,v, v, m")
+        (vec_merge:VEC_1REG_MODE
+	  (match_operand:VEC_1REG_MODE 1 "general_operand"    "vA,B, m, v")
+	  (match_dup 0)
+	  (match_operand:DI 2 "gcn_exec_reg_operand"	      " e,e, e, e")))
+   (clobber (match_scratch:V64DI 3			      "=X,X,&v,&v"))]
+  "!MEM_P (operands[0]) || REG_P (operands[1])"
+  "@
+  v_mov_b32\t%0, %1
+  v_mov_b32\t%0, %1
+  #
+  #"
+  [(set_attr "type" "vop1,vop1,*,*")
+   (set_attr "length" "4,8,16,16")])
+
+(define_insn "mov<mode>_vector"
+  [(set (match_operand:VEC_2REG_MODE 0 "nonimmediate_operand"
+						       "= v,   v,   v, v, m")
+        (vec_merge:VEC_2REG_MODE
+	  (match_operand:VEC_2REG_MODE 1 "general_operand"
+						       "vDB,  v0,  v0, m, v")
+	  (match_operand:VEC_2REG_MODE 3 "gcn_alu_or_unspec_operand"
+						       " U0,vDA0,vDA0,U0,U0")
+	  (match_operand:DI 2 "register_operand"       "  e,  cV,  Sg, e, e")))
+   (clobber (match_scratch:V64DI 4		       "= X,   X,   X,&v,&v"))]
+  "!MEM_P (operands[0]) || REG_P (operands[1])"
+  {
+    if (!REG_P (operands[1]) || REGNO (operands[0]) <= REGNO (operands[1]))
+      switch (which_alternative)
+	{
+	case 0:
+	  return "v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1";
+	case 1:
+	  return "v_cndmask_b32\t%L0, %L3, %L1, vcc\;"
+		 "v_cndmask_b32\t%H0, %H3, %H1, vcc";
+	case 2:
+	  return "v_cndmask_b32\t%L0, %L3, %L1, %2\;"
+		 "v_cndmask_b32\t%H0, %H3, %H1, %2";
+	}
+    else
+      switch (which_alternative)
+        {
+	case 0:
+	  return "v_mov_b32\t%H0, %H1\;v_mov_b32\t%L0, %L1";
+	case 1:
+	  return "v_cndmask_b32\t%H0, %H3, %H1, vcc\;"
+		 "v_cndmask_b32\t%L0, %L3, %L1, vcc";
+	case 2:
+	  return "v_cndmask_b32\t%H0, %H3, %H1, %2\;"
+		 "v_cndmask_b32\t%L0, %L3, %L1, %2";
+	}
+
+    return "#";
+  }
+  [(set_attr "type" "vmult,vmult,vmult,*,*")
+   (set_attr "length" "16,16,16,16,16")
+   (set_attr "exec" "*,full,full,*,*")])
+
+; This variant does not accept an unspec, but does permit MEM
+; read/modify/write which is necessary for maskstore.
+
+(define_insn "*mov<mode>_vector_match"
+  [(set (match_operand:VEC_2REG_MODE 0 "nonimmediate_operand" "=v, v, m")
+        (vec_merge:VEC_2REG_MODE
+	  (match_operand:VEC_2REG_MODE 1 "general_operand"   "vDB, m, v")
+	  (match_dup 0)
+	  (match_operand:DI 2 "gcn_exec_reg_operand"	      " e, e, e")))
+   (clobber (match_scratch:V64DI 3			      "=X,&v,&v"))]
+  "!MEM_P (operands[0]) || REG_P (operands[1])"
+  "@
+   * if (!REG_P (operands[1]) || REGNO (operands[0]) <= REGNO (operands[1])) \
+       return \"v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1\"; \
+     else \
+       return \"v_mov_b32\t%H0, %H1\;v_mov_b32\t%L0, %L1\";
+   #
+   #"
+  [(set_attr "type" "vmult,*,*")
+   (set_attr "length" "16,16,16")])
+
+; Expand scalar addresses into gather/scatter patterns
+
+(define_split
+  [(set (match_operand:VEC_REG_MODE 0 "memory_operand")
+	(unspec:VEC_REG_MODE
+	  [(match_operand:VEC_REG_MODE 1 "general_operand")]
+	  UNSPEC_SGPRBASE))
+   (clobber (match_scratch:V64DI 2))]
+  ""
+  [(set (mem:BLK (scratch))
+	(unspec:BLK [(match_dup 5) (match_dup 1)
+		     (match_dup 6) (match_dup 7) (match_dup 8)]
+		    UNSPEC_SCATTER))]
+  {
+    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode, NULL,
+						       operands[0],
+						       operands[2]);
+    operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[0]));
+    operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[0]));
+    operands[8] = gen_rtx_CONST_INT (VOIDmode, -1);
+  })
+
+(define_split
+  [(set (match_operand:VEC_REG_MODE 0 "memory_operand")
+        (vec_merge:VEC_REG_MODE
+	  (match_operand:VEC_REG_MODE 1 "general_operand")
+	  (match_operand:VEC_REG_MODE 3 "")
+	  (match_operand:DI 2 "gcn_exec_reg_operand")))
+   (clobber (match_scratch:V64DI 4))]
+  ""
+  [(set (mem:BLK (scratch))
+	(unspec:BLK [(match_dup 5) (match_dup 1)
+		     (match_dup 6) (match_dup 7) (match_dup 2)]
+		    UNSPEC_SCATTER))]
+  {
+    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode,
+						       operands[2],
+						       operands[0],
+						       operands[4]);
+    operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[0]));
+    operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[0]));
+  })
+
+(define_split
+  [(set (match_operand:VEC_REG_MODE 0 "nonimmediate_operand")
+	(unspec:VEC_REG_MODE
+	  [(match_operand:VEC_REG_MODE 1 "memory_operand")]
+	  UNSPEC_SGPRBASE))
+   (clobber (match_scratch:V64DI 2))]
+  ""
+  [(set (match_dup 0)
+	(vec_merge:VEC_REG_MODE
+	  (unspec:VEC_REG_MODE [(match_dup 5) (match_dup 6) (match_dup 7)
+				(mem:BLK (scratch))]
+			       UNSPEC_GATHER)
+	  (match_dup 8)
+          (match_dup 9)))]
+  {
+    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode, NULL,
+						       operands[1],
+						       operands[2]);
+    operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[1]));
+    operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));
+    operands[8] = gcn_gen_undef (<MODE>mode);
+    operands[9] = gen_rtx_CONST_INT (VOIDmode, -1);
+  })
+
+(define_split
+  [(set (match_operand:VEC_REG_MODE 0 "nonimmediate_operand")
+        (vec_merge:VEC_REG_MODE
+	  (match_operand:VEC_REG_MODE 1 "memory_operand")
+	  (match_operand:VEC_REG_MODE 3 "")
+	  (match_operand:DI 2 "gcn_exec_reg_operand")))
+   (clobber (match_scratch:V64DI 4))]
+  ""
+  [(set (match_dup 0)
+	(vec_merge:VEC_REG_MODE
+	  (unspec:VEC_REG_MODE [(match_dup 5) (match_dup 6) (match_dup 7)
+				(mem:BLK (scratch))]
+			       UNSPEC_GATHER)
+	  (match_dup 3)
+          (match_dup 2)))]
+  {
+    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode,
+						       operands[2],
+						       operands[1],
+						       operands[4]);
+    operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[1]));
+    operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));
+  })
+
+; TODO: Add zero/sign extending variants.
+
+;; }}}
+;; {{{ Lane moves
+
+; v_writelane and v_readlane work regardless of exec flags.
+; We allow source to be scratch.
+;
+; FIXME these should take A immediates
+
+(define_insn "*vec_set<mode>"
+  [(set (match_operand:VEC_1REG_MODE 0 "register_operand"            "= v")
+	(vec_merge:VEC_1REG_MODE
+	  (vec_duplicate:VEC_1REG_MODE
+	    (match_operand:<SCALAR_MODE> 1 "register_operand"	     " SS"))
+	  (match_operand:VEC_1REG_MODE 3 "gcn_register_or_unspec_operand"
+								     " U0")
+	  (ashift (const_int 1)
+		  (match_operand:SI 2 "gcn_alu_operand"		     "SSB"))))]
+  ""
+  "v_writelane_b32 %0, %1, %2"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")
+   (set_attr "laneselect" "yes")])
+
+; FIXME: 64bit operations really should be splitters, but I am not sure how
+; to represent vertical subregs.
+(define_insn "*vec_set<mode>"
+  [(set (match_operand:VEC_2REG_MODE 0 "register_operand"	     "= v")
+	(vec_merge:VEC_2REG_MODE
+	  (vec_duplicate:VEC_2REG_MODE
+	    (match_operand:<SCALAR_MODE> 1 "register_operand"	     " SS"))
+	  (match_operand:VEC_2REG_MODE 3 "gcn_register_or_unspec_operand"
+								     " U0")
+	  (ashift (const_int 1)
+		  (match_operand:SI 2 "gcn_alu_operand"		     "SSB"))))]
+  ""
+  "v_writelane_b32 %L0, %L1, %2\;v_writelane_b32 %H0, %H1, %2"
+  [(set_attr "type" "vmult")
+   (set_attr "length" "16")
+   (set_attr "laneselect" "yes")])
+
+(define_expand "vec_set<mode>"
+  [(set (match_operand:VEC_REG_MODE 0 "register_operand")
+	(vec_merge:VEC_REG_MODE
+	  (vec_duplicate:VEC_REG_MODE
+	    (match_operand:<SCALAR_MODE> 1 "register_operand"))
+	  (match_dup 0)
+	  (ashift (const_int 1) (match_operand:SI 2 "gcn_alu_operand"))))]
+  "")
+
+(define_insn "*vec_set<mode>_1"
+  [(set (match_operand:VEC_1REG_MODE 0 "register_operand"	       "=v")
+	(vec_merge:VEC_1REG_MODE
+	  (vec_duplicate:VEC_1REG_MODE
+	    (match_operand:<SCALAR_MODE> 1 "register_operand"	       "SS"))
+	  (match_operand:VEC_1REG_MODE 3 "gcn_register_or_unspec_operand"
+								       "U0")
+	  (match_operand:SI 2 "const_int_operand"	               " i")))]
+  "((unsigned) exact_log2 (INTVAL (operands[2])) < 64)"
+  {
+    operands[2] = GEN_INT (exact_log2 (INTVAL (operands[2])));
+    return "v_writelane_b32 %0, %1, %2";
+  }
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")
+   (set_attr "laneselect" "yes")])
+
+(define_insn "*vec_set<mode>_1"
+  [(set (match_operand:VEC_2REG_MODE 0 "register_operand"	       "=v")
+	(vec_merge:VEC_2REG_MODE
+	  (vec_duplicate:VEC_2REG_MODE
+	    (match_operand:<SCALAR_MODE> 1 "register_operand"	       "SS"))
+	  (match_operand:VEC_2REG_MODE 3 "gcn_register_or_unspec_operand"
+								       "U0")
+	  (match_operand:SI 2 "const_int_operand"		       " i")))]
+  "((unsigned) exact_log2 (INTVAL (operands[2])) < 64)"
+  {
+    operands[2] = GEN_INT (exact_log2 (INTVAL (operands[2])));
+    return "v_writelane_b32 %L0, %L1, %2\;v_writelane_b32 %H0, %H1, %2";
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "16")
+   (set_attr "laneselect" "yes")])
+
+(define_insn "vec_duplicate<mode>"
+  [(set (match_operand:VEC_1REG_MODE 0 "register_operand"  "=v")
+	(vec_duplicate:VEC_1REG_MODE
+	  (match_operand:<SCALAR_MODE> 1 "gcn_alu_operand" "SgB")))]
+  ""
+  "v_mov_b32\t%0, %1"
+  [(set_attr "type" "vop3a")
+   (set_attr "exec" "full")
+   (set_attr "length" "8")])
+
+(define_insn "vec_duplicate<mode>"
+  [(set (match_operand:VEC_2REG_MODE 0 "register_operand"  "=  v")
+	(vec_duplicate:VEC_2REG_MODE
+	  (match_operand:<SCALAR_MODE> 1 "gcn_alu_operand" "SgDB")))]
+  ""
+  "v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1"
+  [(set_attr "type" "vop3a")
+   (set_attr "exec" "full")
+   (set_attr "length" "16")])
+
+(define_insn "vec_duplicate<mode>_exec"
+  [(set (match_operand:VEC_1REG_MODE 0 "register_operand"	      "= v")
+	(vec_merge:VEC_1REG_MODE
+	  (vec_duplicate:VEC_1REG_MODE
+	    (match_operand:<SCALAR_MODE> 1 "gcn_alu_operand"	      "SSB"))
+	  (match_operand:VEC_1REG_MODE 3 "gcn_register_or_unspec_operand"
+								      " U0")
+	  (match_operand:DI 2 "gcn_exec_reg_operand"		      "  e")))]
+  ""
+  "v_mov_b32\t%0, %1"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "vec_duplicate<mode>_exec"
+  [(set (match_operand:VEC_2REG_MODE 0 "register_operand"	      "= v")
+	(vec_merge:VEC_2REG_MODE
+	  (vec_duplicate:VEC_2REG_MODE
+	    (match_operand:<SCALAR_MODE> 1 "register_operand"	     "SgDB"))
+	  (match_operand:VEC_2REG_MODE 3 "gcn_register_or_unspec_operand"
+								      " U0")
+	  (match_operand:DI 2 "gcn_exec_reg_operand"		      "  e")))]
+  ""
+  "v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1"
+  [(set_attr "type" "vmult")
+   (set_attr "length" "16")])
+
+(define_insn "vec_extract<mode><scalar_mode>"
+  [(set (match_operand:<SCALAR_MODE> 0 "register_operand"   "=Sg")
+	(vec_select:<SCALAR_MODE>
+	  (match_operand:VEC_1REG_MODE 1 "register_operand" "  v")
+	  (parallel [(match_operand:SI 2 "gcn_alu_operand"  "SSB")])))]
+  ""
+  "v_readlane_b32 %0, %1, %2"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")
+   (set_attr "laneselect" "yes")])
+
+(define_insn "vec_extract<mode><scalar_mode>"
+  [(set (match_operand:<SCALAR_MODE> 0 "register_operand"   "=Sg")
+	(vec_select:<SCALAR_MODE>
+	  (match_operand:VEC_2REG_MODE 1 "register_operand" "  v")
+	  (parallel [(match_operand:SI 2 "gcn_alu_operand"  "SSB")])))]
+  ""
+  "v_readlane_b32 %L0, %L1, %2\;v_readlane_b32 %H0, %H1, %2"
+  [(set_attr "type" "vmult")
+   (set_attr "length" "16")
+   (set_attr "laneselect" "yes")])
+
+(define_expand "vec_init<mode><scalar_mode>"
+  [(match_operand:VEC_REG_MODE 0 "register_operand")
+   (match_operand 1)]
+  ""
+  {
+    gcn_expand_vector_init (operands[0], operands[1]);
+    DONE;
+  })
+
+;; }}}
+;; {{{ Scatter / Gather
+
+;; GCN does not have an instruction for loading a vector from contiguous
+;; memory so *all* loads and stores are eventually converted to scatter
+;; or gather.
+;;
+;; GCC does not permit MEM to hold vectors of addresses, so we must use an
+;; unspec.  The unspec formats are as follows:
+;;
+;;     (unspec:V64??
+;;	 [(<address expression>)
+;;	  (<addr_space_t>)
+;;	  (<use_glc>)
+;;	  (mem:BLK (scratch))]
+;;	 UNSPEC_GATHER)
+;;
+;;     (unspec:BLK
+;;	  [(<address expression>)
+;;	   (<source register>)
+;;	   (<addr_space_t>)
+;;	   (<use_glc>)
+;;	   (<exec>)]
+;;	  UNSPEC_SCATTER)
+;;
+;; - Loads are expected to be wrapped in a vec_merge, so do not need <exec>.
+;; - The mem:BLK does not contain any real information, but indicates that an
+;;   unknown memory read is taking place.  Stores are expected to use a similar
+;;   mem:BLK outside the unspec.
+;; - The address space and glc (volatile) fields are there to replace the
+;;   fields normally found in a MEM.
+;; - Multiple forms of address expression are supported, below.
+
+(define_expand "gather_load<mode>"
+  [(match_operand:VEC_REG_MODE 0 "register_operand")
+   (match_operand:DI 1 "register_operand")
+   (match_operand 2 "register_operand")
+   (match_operand 3 "immediate_operand")
+   (match_operand:SI 4 "gcn_alu_operand")]
+  ""
+  {
+    rtx exec = gcn_full_exec_reg ();
+
+    /* TODO: more conversions will be needed when more types are vectorized. */
+    if (GET_MODE (operands[2]) == V64DImode)
+      {
+        rtx tmp = gen_reg_rtx (V64SImode);
+	emit_insn (gen_vec_truncatev64div64si (tmp, operands[2],
+					       gcn_gen_undef (V64SImode),
+					       exec));
+	operands[2] = tmp;
+      }
+
+    emit_insn (gen_gather<mode>_exec (operands[0], operands[1], operands[2],
+				      operands[3], operands[4], exec));
+    DONE;
+  })
+
+(define_expand "gather<mode>_exec"
+  [(match_operand:VEC_REG_MODE 0 "register_operand")
+   (match_operand:DI 1 "register_operand")
+   (match_operand:V64SI 2 "register_operand")
+   (match_operand 3 "immediate_operand")
+   (match_operand:SI 4 "gcn_alu_operand")
+   (match_operand:DI 5 "gcn_exec_reg_operand")]
+  ""
+  {
+    rtx dest = operands[0];
+    rtx base = operands[1];
+    rtx offsets = operands[2];
+    int unsignedp = INTVAL (operands[3]);
+    rtx scale = operands[4];
+    rtx exec = operands[5];
+
+    rtx tmpsi = gen_reg_rtx (V64SImode);
+    rtx tmpdi = gen_reg_rtx (V64DImode);
+    rtx undefsi = gcn_gen_undef (V64SImode);
+    rtx undefdi = gcn_gen_undef (V64DImode);
+    rtx undefmode = gcn_gen_undef (<MODE>mode);
+
+    if (CONST_INT_P (scale)
+	&& INTVAL (scale) > 0
+	&& exact_log2 (INTVAL (scale)) >= 0)
+      emit_insn (gen_ashlv64si3 (tmpsi, offsets,
+				 GEN_INT (exact_log2 (INTVAL (scale)))));
+    else
+      emit_insn (gen_mulv64si3_vector_dup (tmpsi, offsets, scale, exec,
+					   undefsi));
+
+    if (DEFAULT_ADDR_SPACE == ADDR_SPACE_FLAT)
+      {
+        if (unsignedp)
+	  emit_insn (gen_addv64di3_zext_dup2 (tmpdi, tmpsi, base, exec,
+					      undefdi));
+	else
+	  emit_insn (gen_addv64di3_sext_dup2 (tmpdi, tmpsi, base, exec,
+					      undefdi));
+	emit_insn (gen_gather<mode>_insn_1offset (dest, tmpdi, const0_rtx,
+						  const0_rtx, const0_rtx,
+						  undefmode, exec));
+      }
+    else if (DEFAULT_ADDR_SPACE == ADDR_SPACE_GLOBAL)
+      emit_insn (gen_gather<mode>_insn_2offsets (dest, base, tmpsi, const0_rtx,
+						 const0_rtx, const0_rtx,
+						 undefmode, exec));
+    else
+      gcc_unreachable ();
+    DONE;
+  })
+
+; Allow any address expression
+(define_expand "gather<mode>_expr"
+  [(set (match_operand:VEC_REG_MODE 0 "register_operand")
+	(vec_merge:VEC_REG_MODE
+	  (unspec:VEC_REG_MODE
+	    [(match_operand 1 "")
+	     (match_operand 2 "immediate_operand")
+	     (match_operand 3 "immediate_operand")
+	     (mem:BLK (scratch))]
+	    UNSPEC_GATHER)
+	  (match_operand:VEC_REG_MODE 4 "gcn_register_or_unspec_operand")
+          (match_operand:DI 5 "gcn_exec_operand")))]
+    ""
+    {})
+
+(define_insn "gather<mode>_insn_1offset"
+  [(set (match_operand:VEC_REG_MODE 0 "register_operand"	   "=v,  v")
+	(vec_merge:VEC_REG_MODE
+	  (unspec:VEC_REG_MODE
+	    [(plus:V64DI (match_operand:V64DI 1 "register_operand" " v,  v")
+			 (vec_duplicate:V64DI
+			   (match_operand 2 "immediate_operand"	   " n,  n")))
+	     (match_operand 3 "immediate_operand"		   " n,  n")
+	     (match_operand 4 "immediate_operand"		   " n,  n")
+	     (mem:BLK (scratch))]
+	    UNSPEC_GATHER)
+	  (match_operand:VEC_REG_MODE 5 "gcn_register_or_unspec_operand"
+								   "U0, U0")
+          (match_operand:DI 6 "gcn_exec_operand"		   " e,*Kf")))]
+  "(AS_FLAT_P (INTVAL (operands[3]))
+    && ((TARGET_GCN3 && INTVAL(operands[2]) == 0)
+	|| ((unsigned HOST_WIDE_INT)INTVAL(operands[2]) < 0x1000)))
+    || (AS_GLOBAL_P (INTVAL (operands[3]))
+	&& (((unsigned HOST_WIDE_INT)INTVAL(operands[2]) + 0x1000) < 0x2000))"
+  {
+    addr_space_t as = INTVAL (operands[3]);
+    const char *glc = INTVAL (operands[4]) ? " glc" : "";
+
+    static char buf[200];
+    if (AS_FLAT_P (as))
+      {
+        if (TARGET_GCN5_PLUS)
+          sprintf (buf, "flat_load%%s0\t%%0, %%1 offset:%%2%s\;s_waitcnt\t0",
+		   glc);
+	else
+          sprintf (buf, "flat_load%%s0\t%%0, %%1%s\;s_waitcnt\t0", glc);
+      }
+    else if (AS_GLOBAL_P (as))
+      sprintf (buf, "global_load%%s0\t%%0, %%1, off offset:%%2%s\;"
+	       "s_waitcnt\tvmcnt(0)", glc);
+    else
+      gcc_unreachable ();
+
+    return buf;
+  }
+  [(set_attr "type" "flat")
+   (set_attr "length" "12")
+   (set_attr "exec" "*,full")])
+
+(define_insn "gather<mode>_insn_1offset_ds"
+  [(set (match_operand:VEC_REG_MODE 0 "register_operand"	   "=v,  v")
+	(vec_merge:VEC_REG_MODE
+	  (unspec:VEC_REG_MODE
+	    [(plus:V64SI (match_operand:V64SI 1 "register_operand" " v,  v")
+			 (vec_duplicate:V64SI
+			   (match_operand 2 "immediate_operand"	   " n,  n")))
+	     (match_operand 3 "immediate_operand"		   " n,  n")
+	     (match_operand 4 "immediate_operand"		   " n,  n")
+	     (mem:BLK (scratch))]
+	    UNSPEC_GATHER)
+	  (match_operand:VEC_REG_MODE 5 "gcn_register_or_unspec_operand"
+								   "U0, U0")
+          (match_operand:DI 6 "gcn_exec_operand"		   " e,*Kf")))]
+  "(AS_ANY_DS_P (INTVAL (operands[3]))
+    && ((unsigned HOST_WIDE_INT)INTVAL(operands[2]) < 0x10000))"
+  {
+    addr_space_t as = INTVAL (operands[3]);
+    static char buf[200];
+    sprintf (buf, "ds_read%%b0\t%%0, %%1 offset:%%2%s\;s_waitcnt\tlgkmcnt(0)",
+	     (AS_GDS_P (as) ? " gds" : ""));
+    return buf;
+  }
+  [(set_attr "type" "ds")
+   (set_attr "length" "12")
+   (set_attr "exec" "*,full")])
+
+(define_insn "gather<mode>_insn_2offsets"
+  [(set (match_operand:VEC_REG_MODE 0 "register_operand"	       "=v")
+	(vec_merge:VEC_REG_MODE
+	  (unspec:VEC_REG_MODE
+	    [(plus:V64DI
+	       (plus:V64DI
+		 (vec_duplicate:V64DI
+		   (match_operand:DI 1 "register_operand"	       "SS"))
+		 (sign_extend:V64DI
+		   (match_operand:V64SI 2 "register_operand"	       " v")))
+	       (vec_duplicate:V64DI (match_operand 3 "immediate_operand" 
+								       " n")))
+	     (match_operand 4 "immediate_operand"		       " n")
+	     (match_operand 5 "immediate_operand"		       " n")
+	     (mem:BLK (scratch))]
+	    UNSPEC_GATHER)
+	  (match_operand:VEC_REG_MODE 6 "gcn_register_or_unspec_operand"
+								       "U0")
+          (match_operand:DI 7 "gcn_exec_operand"		       " e")))]
+  "(AS_GLOBAL_P (INTVAL (operands[4]))
+    && (((unsigned HOST_WIDE_INT)INTVAL(operands[3]) + 0x1000) < 0x2000))"
+  {
+    addr_space_t as = INTVAL (operands[4]);
+    const char *glc = INTVAL (operands[5]) ? " glc" : "";
+
+    static char buf[200];
+    if (AS_GLOBAL_P (as))
+      {
+	/* Work around assembler bug in which a 64-bit register is expected,
+	but a 32-bit value would be correct.  */
+	int reg = REGNO (operands[2]) - FIRST_VGPR_REG;
+	sprintf (buf, "global_load%%s0\t%%0, v[%d:%d], %%1 offset:%%3%s\;"
+		      "s_waitcnt\tvmcnt(0)", reg, reg + 1, glc);
+      }
+    else
+      gcc_unreachable ();
+      
+    return buf;
+  }
+  [(set_attr "type" "flat")
+   (set_attr "length" "12")])
+
+(define_expand "scatter_store<mode>"
+  [(match_operand:DI 0 "register_operand")
+   (match_operand 1 "register_operand")
+   (match_operand 2 "immediate_operand")
+   (match_operand:SI 3 "gcn_alu_operand")
+   (match_operand:VEC_REG_MODE 4 "register_operand")]
+  ""
+  {
+    rtx exec = gcn_full_exec_reg ();
+
+    /* TODO: more conversions will be needed when more types are vectorized. */
+    if (GET_MODE (operands[1]) == V64DImode)
+      {
+        rtx tmp = gen_reg_rtx (V64SImode);
+	emit_insn (gen_vec_truncatev64div64si (tmp, operands[1],
+					       gcn_gen_undef (V64SImode),
+					       exec));
+	operands[1] = tmp;
+      }
+
+    emit_insn (gen_scatter<mode>_exec (operands[0], operands[1], operands[2],
+				       operands[3], operands[4], exec));
+    DONE;
+  })
+
+(define_expand "scatter<mode>_exec"
+  [(match_operand:DI 0 "register_operand")
+   (match_operand 1 "register_operand")
+   (match_operand 2 "immediate_operand")
+   (match_operand:SI 3 "gcn_alu_operand")
+   (match_operand:VEC_REG_MODE 4 "register_operand")
+   (match_operand:DI 5 "gcn_exec_reg_operand")]
+  ""
+  {
+    rtx base = operands[0];
+    rtx offsets = operands[1];
+    int unsignedp = INTVAL (operands[2]);
+    rtx scale = operands[3];
+    rtx src = operands[4];
+    rtx exec = operands[5];
+
+    rtx tmpsi = gen_reg_rtx (V64SImode);
+    rtx tmpdi = gen_reg_rtx (V64DImode);
+    rtx undefsi = gcn_gen_undef (V64SImode);
+    rtx undefdi = gcn_gen_undef (V64DImode);
+
+    if (CONST_INT_P (scale)
+	&& INTVAL (scale) > 0
+	&& exact_log2 (INTVAL (scale)) >= 0)
+      emit_insn (gen_ashlv64si3 (tmpsi, offsets,
+				 GEN_INT (exact_log2 (INTVAL (scale)))));
+    else
+      emit_insn (gen_mulv64si3_vector_dup (tmpsi, offsets, scale, exec,
+					   undefsi));
+
+    if (DEFAULT_ADDR_SPACE == ADDR_SPACE_FLAT)
+      {
+	if (unsignedp)
+	  emit_insn (gen_addv64di3_zext_dup2 (tmpdi, tmpsi, base, exec,
+					      undefdi));
+	else
+	  emit_insn (gen_addv64di3_sext_dup2 (tmpdi, tmpsi, base, exec,
+					      undefdi));
+	emit_insn (gen_scatter<mode>_insn_1offset (tmpdi, const0_rtx, src,
+						   const0_rtx, const0_rtx,
+						   exec));
+      }
+    else if (DEFAULT_ADDR_SPACE == ADDR_SPACE_GLOBAL)
+      emit_insn (gen_scatter<mode>_insn_2offsets (base, tmpsi, const0_rtx, src,
+						  const0_rtx, const0_rtx,
+						  exec));
+    else
+      gcc_unreachable ();
+    DONE;
+  })
+
+; Allow any address expression
+(define_expand "scatter<mode>_expr"
+  [(set (mem:BLK (scratch))
+	(unspec:BLK
+	  [(match_operand:V64DI 0 "")
+	   (match_operand:VEC_REG_MODE 1 "register_operand")
+	   (match_operand 2 "immediate_operand")
+	   (match_operand 3 "immediate_operand")
+	   (match_operand:DI 4 "gcn_exec_operand")]
+	  UNSPEC_SCATTER))]
+  ""
+  {})
+
+(define_insn "scatter<mode>_insn_1offset"
+  [(set (mem:BLK (scratch))
+	(unspec:BLK
+	  [(plus:V64DI (match_operand:V64DI 0 "register_operand" "v,  v")
+		       (vec_duplicate:V64DI
+			 (match_operand 1 "immediate_operand"	 "n,  n")))
+	   (match_operand:VEC_REG_MODE 2 "register_operand"	 "v,  v")
+	   (match_operand 3 "immediate_operand"			 "n,  n")
+	   (match_operand 4 "immediate_operand"			 "n,  n")
+	   (match_operand:DI 5 "gcn_exec_operand"		 "e,*Kf")]
+	  UNSPEC_SCATTER))]
+  "(AS_FLAT_P (INTVAL (operands[3]))
+    && (INTVAL(operands[1]) == 0
+	|| (TARGET_GCN5_PLUS
+	    && (unsigned HOST_WIDE_INT)INTVAL(operands[1]) < 0x1000)))
+    || (AS_GLOBAL_P (INTVAL (operands[3]))
+	&& (((unsigned HOST_WIDE_INT)INTVAL(operands[1]) + 0x1000) < 0x2000))"
+  {
+    addr_space_t as = INTVAL (operands[3]);
+    const char *glc = INTVAL (operands[4]) ? " glc" : "";
+
+    static char buf[200];
+    if (AS_FLAT_P (as))
+      {
+	if (TARGET_GCN5_PLUS)
+	  sprintf (buf, "flat_store%%s2\t%%0, %%2 offset:%%1%s\;s_waitcnt\t0",
+		   glc);
+	else
+	  sprintf (buf, "flat_store%%s2\t%%0, %%2%s\;s_waitcnt\t0", glc);
+      }
+    else if (AS_GLOBAL_P (as))
+      sprintf (buf, "global_store%%s2\t%%0, %%2, off offset:%%1%s\;"
+	       "s_waitcnt\tvmcnt(0)", glc);
+    else
+      gcc_unreachable ();
+
+    return buf;
+  }
+  [(set_attr "type" "flat")
+   (set_attr "length" "12")
+   (set_attr "exec" "*,full")])
+
+(define_insn "scatter<mode>_insn_1offset_ds"
+  [(set (mem:BLK (scratch))
+	(unspec:BLK
+	  [(plus:V64SI (match_operand:V64SI 0 "register_operand" "v,  v")
+		       (vec_duplicate:V64SI
+			 (match_operand 1 "immediate_operand"	 "n,  n")))
+	   (match_operand:VEC_REG_MODE 2 "register_operand"	 "v,  v")
+	   (match_operand 3 "immediate_operand"			 "n,  n")
+	   (match_operand 4 "immediate_operand"			 "n,  n")
+	   (match_operand:DI 5 "gcn_exec_operand"		 "e,*Kf")]
+	  UNSPEC_SCATTER))]
+  "(AS_ANY_DS_P (INTVAL (operands[3]))
+    && ((unsigned HOST_WIDE_INT)INTVAL(operands[1]) < 0x10000))"
+  {
+    addr_space_t as = INTVAL (operands[3]);
+    static char buf[200];
+    sprintf (buf, "ds_write%%b2\t%%0, %%2 offset:%%1%s\;s_waitcnt\tlgkmcnt(0)",
+	     (AS_GDS_P (as) ? " gds" : ""));
+    return buf;
+  }
+  [(set_attr "type" "ds")
+   (set_attr "length" "12")
+   (set_attr "exec" "*,full")])
+
+(define_insn "scatter<mode>_insn_2offsets"
+  [(set (mem:BLK (scratch))
+	(unspec:BLK
+	  [(plus:V64DI
+	     (plus:V64DI
+	       (vec_duplicate:V64DI
+		 (match_operand:DI 0 "register_operand"		       "SS"))
+	       (sign_extend:V64DI
+		 (match_operand:V64SI 1 "register_operand"	       " v")))
+	     (vec_duplicate:V64DI (match_operand 2 "immediate_operand" " n")))
+	   (match_operand:VEC_REG_MODE 3 "register_operand"	       " v")
+	   (match_operand 4 "immediate_operand"			       " n")
+	   (match_operand 5 "immediate_operand"			       " n")
+	   (match_operand:DI 6 "gcn_exec_operand"		       " e")]
+	  UNSPEC_SCATTER))]
+  "(AS_GLOBAL_P (INTVAL (operands[4]))
+    && (((unsigned HOST_WIDE_INT)INTVAL(operands[2]) + 0x1000) < 0x2000))"
+  {
+    addr_space_t as = INTVAL (operands[4]);
+    const char *glc = INTVAL (operands[5]) ? " glc" : "";
+
+    static char buf[200];
+    if (AS_GLOBAL_P (as))
+      {
+	/* Work around assembler bug in which a 64-bit register is expected,
+	but a 32-bit value would be correct.  */
+	int reg = REGNO (operands[1]) - FIRST_VGPR_REG;
+	sprintf (buf, "global_store%%s3\tv[%d:%d], %%3, %%0 offset:%%2%s\;"
+		      "s_waitcnt\tvmcnt(0)", reg, reg + 1, glc);
+      }
+    else
+      gcc_unreachable ();
+
+    return buf;
+  }
+  [(set_attr "type" "flat")
+   (set_attr "length" "12")])
+
+;; }}}
+;; {{{ Permutations
+
+(define_insn "ds_bpermute<mode>"
+  [(set (match_operand:VEC_1REG_MODE 0 "register_operand"    "=v")
+	(unspec:VEC_1REG_MODE
+	  [(match_operand:VEC_1REG_MODE 2 "register_operand" " v")
+	   (match_operand:V64SI 1 "register_operand"	     " v")
+	   (match_operand:DI 3 "gcn_exec_reg_operand"	     " e")]
+	  UNSPEC_BPERMUTE))]
+  ""
+  "ds_bpermute_b32\t%0, %1, %2\;s_waitcnt\tlgkmcnt(0)"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "12")])
+
+(define_insn_and_split "ds_bpermute<mode>"
+  [(set (match_operand:VEC_2REG_MODE 0 "register_operand"    "=&v")
+	(unspec:VEC_2REG_MODE
+	  [(match_operand:VEC_2REG_MODE 2 "register_operand" " v0")
+	   (match_operand:V64SI 1 "register_operand"	     "  v")
+	   (match_operand:DI 3 "gcn_exec_reg_operand"	     "  e")]
+	  UNSPEC_BPERMUTE))]
+  ""
+  "#"
+  "reload_completed"
+  [(set (match_dup 4) (unspec:V64SI [(match_dup 6) (match_dup 1) (match_dup 3)]
+				    UNSPEC_BPERMUTE))
+   (set (match_dup 5) (unspec:V64SI [(match_dup 7) (match_dup 1) (match_dup 3)]
+				    UNSPEC_BPERMUTE))]
+  {
+    operands[4] = gcn_operand_part (<MODE>mode, operands[0], 0);
+    operands[5] = gcn_operand_part (<MODE>mode, operands[0], 1);
+    operands[6] = gcn_operand_part (<MODE>mode, operands[2], 0);
+    operands[7] = gcn_operand_part (<MODE>mode, operands[2], 1);
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "24")])
+
+;; }}}
+;; {{{ ALU special case: add/sub
+
+(define_mode_iterator V64SIDI [V64SI V64DI])
+
+(define_expand "<expander><mode>3"
+  [(parallel [(set (match_operand:V64SIDI 0 "register_operand")
+		   (vec_merge:V64SIDI
+		     (plus_minus:V64SIDI
+		       (match_operand:V64SIDI 1 "register_operand")
+		       (match_operand:V64SIDI 2 "gcn_alu_operand"))
+		     (match_dup 4)
+		     (match_dup 3)))
+	      (clobber (reg:DI VCC_REG))])]
+  ""
+  {
+    operands[3] = gcn_full_exec_reg ();
+    operands[4] = gcn_gen_undef (<MODE>mode);
+  })
+
+(define_insn "addv64si3_vector"
+  [(set (match_operand:V64SI 0 "register_operand"		  "=  v")
+	(vec_merge:V64SI
+	  (plus:V64SI
+	    (match_operand:V64SI 1 "register_operand"		  "%  v")
+	    (match_operand:V64SI 2 "gcn_alu_operand"		  "vSSB"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" "  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "   e")))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "v_add%^_u32\t%0, vcc, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8")])
+
+(define_insn "addsi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"	   "=  v")
+	  (plus:SI
+	    (match_operand:SI 1 "register_operand" "%  v")
+	    (match_operand:SI 2 "gcn_alu_operand"  "vSSB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"	   "   e"))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "v_add%^_u32\t%0, vcc, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8")])
+
+(define_insn "addv64si3_vector_dup"
+  [(set (match_operand:V64SI 0 "register_operand"		  "= v,  v")
+	(vec_merge:V64SI
+	  (plus:V64SI
+	    (vec_duplicate:V64SI
+	      (match_operand:SI 2 "gcn_alu_operand"		  "SSB,SSB"))
+	    (match_operand:V64SI 1 "register_operand"		  "  v,  v"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" " U0, U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e,*Kf")))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "v_add%^_u32\t%0, vcc, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8")
+   (set_attr "exec" "*,full")])
+
+(define_insn "addv64si3_vector_vcc"
+  [(set (match_operand:V64SI 0 "register_operand"	      "=  v,   v")
+	(vec_merge:V64SI
+	  (plus:V64SI
+	    (match_operand:V64SI 1 "register_operand"	      "%  v,   v")
+	    (match_operand:V64SI 2 "gcn_alu_operand"	      "vSSB,vSSB"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand"
+							      "  U0,  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"	      "   e,   e")))
+   (set (match_operand:DI 5 "register_operand"		      "= cV,  Sg")
+	(ior:DI (and:DI (ltu:DI (plus:V64SI (match_dup 1) (match_dup 2))
+				(match_dup 1))
+			(match_dup 3))
+		(and:DI (not:DI (match_dup 3))
+			(match_operand:DI 6 "gcn_register_or_unspec_operand" 
+							      "  U5,  U5"))))]
+  ""
+  "v_add%^_u32\t%0, %5, %2, %1"
+  [(set_attr "type" "vop2,vop3b")
+   (set_attr "length" "8")])
+
+; This pattern only changes the VCC bits when the corresponding lane is
+; enabled, so the set must be described as an ior.
+
+(define_insn "addv64si3_vector_vcc_dup"
+  [(set (match_operand:V64SI 0 "register_operand"		 "= v,  v")
+	(vec_merge:V64SI
+	  (plus:V64SI
+	    (vec_duplicate:V64SI (match_operand:SI 2 "gcn_alu_operand"
+								 "SSB,SSB"))
+	    (match_operand:V64SI 1 "register_operand"		 "  v,  v"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" "U0, U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		 "  e,  e")))
+   (set (match_operand:DI 5 "register_operand"			 "=cV, Sg")
+	(ior:DI (and:DI (ltu:DI (plus:V64SI (vec_duplicate:V64SI (match_dup 2))
+					    (match_dup 1))
+				(vec_duplicate:V64SI (match_dup 2)))
+			(match_dup 3))
+		(and:DI (not:DI (match_dup 3))
+			(match_operand:DI 6 "gcn_register_or_unspec_operand"
+								 " 5U, 5U"))))]
+  ""
+  "v_add%^_u32\t%0, %5, %2, %1"
+  [(set_attr "type" "vop2,vop3b")
+   (set_attr "length" "8,8")])
+
+; This pattern does not accept SGPR because VCC read already counts as an
+; SGPR use and number of SGPR operands is limited to 1.
+
+(define_insn "addcv64si3_vec"
+  [(set (match_operand:V64SI 0 "register_operand" "=v,v")
+        (vec_merge:V64SI
+	  (plus:V64SI
+	    (plus:V64SI
+	      (vec_merge:V64SI
+		(match_operand:V64SI 7 "gcn_vec1_operand"	  "  A, A")
+		(match_operand:V64SI 8 "gcn_vec0_operand"	  "  A, A")
+		(match_operand:DI 5 "register_operand"		  " cV,Sg"))
+	      (match_operand:V64SI 1 "gcn_alu_operand"		  "%vA,vA"))
+	    (match_operand:V64SI 2 "gcn_alu_operand"		  " vB,vB"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" " U0,U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e, e")))
+   (set (match_operand:DI 6 "register_operand"			  "=cV,Sg")
+	(ior:DI (and:DI (ior:DI (ltu:DI (plus:V64SI (plus:V64SI
+						      (vec_merge:V64SI
+							(match_dup 7)
+							(match_dup 8)
+							(match_dup 5))
+						      (match_dup 1))
+						    (match_dup 2))
+					(match_dup 2))
+				(ltu:DI (plus:V64SI (vec_merge:V64SI
+						      (match_dup 7)
+						      (match_dup 8)
+						      (match_dup 5))
+						    (match_dup 1))
+					(match_dup 1)))
+			(match_dup 3))
+		(and:DI (not:DI (match_dup 3))
+			(match_operand:DI 9 "gcn_register_or_unspec_operand"
+								  " 6U,6U"))))]
+  ""
+  "v_addc%^_u32\t%0, %6, %1, %2, %5"
+  [(set_attr "type" "vop2,vop3b")
+   (set_attr "length" "4,8")])
+
+(define_insn "addcv64si3_vec_dup"
+  [(set (match_operand:V64SI 0 "register_operand" "=v,v")
+        (vec_merge:V64SI
+	  (plus:V64SI
+	    (plus:V64SI
+	      (vec_merge:V64SI
+		(match_operand:V64SI 7 "gcn_vec1_operand"	  "  A,  A")
+		(match_operand:V64SI 8 "gcn_vec0_operand"	  "  A,  A")
+		(match_operand:DI 5 "register_operand"		  " cV, Sg"))
+	      (match_operand:V64SI 1 "gcn_alu_operand"		  "%vA, vA"))
+	    (vec_duplicate:V64SI
+	      (match_operand:SI 2 "gcn_alu_operand"		  "SSB,SSB")))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" " U0, U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e,  e")))
+   (set (match_operand:DI 6 "register_operand"			  "=cV, Sg")
+	(ior:DI (and:DI (ior:DI (ltu:DI (plus:V64SI (plus:V64SI
+						      (vec_merge:V64SI
+							(match_dup 7)
+							(match_dup 8)
+							(match_dup 5))
+						      (match_dup 1))
+						    (vec_duplicate:V64SI
+						      (match_dup 2)))
+					(vec_duplicate:V64SI
+					  (match_dup 2)))
+				(ltu:DI (plus:V64SI (vec_merge:V64SI
+						      (match_dup 7)
+						      (match_dup 8)
+						      (match_dup 5))
+						    (match_dup 1))
+					(match_dup 1)))
+			(match_dup 3))
+		(and:DI (not:DI (match_dup 3))
+			(match_operand:DI 9 "gcn_register_or_unspec_operand"
+								  " 6U,6U"))))]
+  ""
+  "v_addc%^_u32\t%0, %6, %1, %2, %5"
+  [(set_attr "type" "vop2,vop3b")
+   (set_attr "length" "4,8")])
+
+(define_insn "subv64si3_vector"
+  [(set (match_operand:V64SI 0 "register_operand"		 "=  v,   v")
+	(vec_merge:V64SI
+	  (minus:V64SI
+	    (match_operand:V64SI 1 "gcn_alu_operand"		 "vSSB,   v")
+	    (match_operand:V64SI 2 "gcn_alu_operand"		 "   v,vSSB"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" " U0,  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		 "   e,   e")))
+   (clobber (reg:DI VCC_REG))]
+  "register_operand (operands[1], VOIDmode)
+   || register_operand (operands[2], VOIDmode)"
+  "@
+   v_sub%^_u32\t%0, vcc, %1, %2
+   v_subrev%^_u32\t%0, vcc, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8,8")])
+
+(define_insn "subsi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"	  "=  v,   v")
+	  (minus:SI
+	    (match_operand:SI 1 "gcn_alu_operand" "vSSB,   v")
+	    (match_operand:SI 2 "gcn_alu_operand" "   v,vSSB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"	  "   e,   e"))
+   (clobber (reg:DI VCC_REG))]
+  "register_operand (operands[1], VOIDmode)
+   || register_operand (operands[2], VOIDmode)"
+  "@
+   v_sub%^_u32\t%0, vcc, %1, %2
+   v_subrev%^_u32\t%0, vcc, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8,8")])
+
+(define_insn "subv64si3_vector_vcc"
+  [(set (match_operand:V64SI 0 "register_operand"    "=  v,   v,   v,   v")
+	(vec_merge:V64SI
+	  (minus:V64SI
+	    (match_operand:V64SI 1 "gcn_alu_operand" "vSSB,vSSB,   v,   v")
+	    (match_operand:V64SI 2 "gcn_alu_operand" "   v,   v,vSSB,vSSB"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand"
+						     "  U0,  U0,  U0,  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand" "   e,   e,   e,   e")))
+   (set (match_operand:DI 5 "register_operand"	     "= cV,  Sg,  cV,  Sg")
+	(ior:DI (and:DI (gtu:DI (minus:V64SI (match_dup 1)
+					     (match_dup 2))
+				(match_dup 1))
+			(match_dup 3))
+		(and:DI (not:DI (match_dup 3))
+			(match_operand:DI 6 "gcn_register_or_unspec_operand"
+						     "  5U,  5U,  5U,  5U"))))]
+  "register_operand (operands[1], VOIDmode)
+   || register_operand (operands[2], VOIDmode)"
+  "@
+   v_sub%^_u32\t%0, %5, %1, %2
+   v_sub%^_u32\t%0, %5, %1, %2
+   v_subrev%^_u32\t%0, %5, %2, %1
+   v_subrev%^_u32\t%0, %5, %2, %1"
+  [(set_attr "type" "vop2,vop3b,vop2,vop3b")
+   (set_attr "length" "8")])
+
+; This pattern does not accept SGPR because VCC read already counts
+; as a SGPR use and number of SGPR operands is limited to 1.
+
+(define_insn "subcv64si3_vec"
+  [(set (match_operand:V64SI 0 "register_operand"	    "= v, v, v, v")
+        (vec_merge:V64SI
+	  (minus:V64SI
+	    (minus:V64SI
+	      (vec_merge:V64SI
+		(match_operand:V64SI 7 "gcn_vec1_operand"   "  A, A, A, A")
+		(match_operand:V64SI 8 "gcn_vec0_operand"   "  A, A, A, A")
+		(match_operand:DI 5 "gcn_alu_operand"	    " cV,Sg,cV,Sg"))
+	      (match_operand:V64SI 1 "gcn_alu_operand"	    " vA,vA,vB,vB"))
+	    (match_operand:V64SI 2 "gcn_alu_operand"	    " vB,vB,vA,vA"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand"
+							    " U0,U0,U0,U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"	    "  e, e, e, e")))
+   (set (match_operand:DI 6 "register_operand"		    "=cV,Sg,cV,Sg")
+	(ior:DI (and:DI (ior:DI (gtu:DI (minus:V64SI (minus:V64SI
+						       (vec_merge:V64SI
+							 (match_dup 7)
+							 (match_dup 8)
+							 (match_dup 5))
+						       (match_dup 1))
+						     (match_dup 2))
+					(match_dup 2))
+				(ltu:DI (minus:V64SI (vec_merge:V64SI
+						       (match_dup 7)
+						       (match_dup 8)
+						       (match_dup 5))
+						     (match_dup 1))
+					(match_dup 1)))
+			(match_dup 3))
+		(and:DI (not:DI (match_dup 3))
+			(match_operand:DI 9 "gcn_register_or_unspec_operand"
+							    " 6U,6U,6U,6U"))))]
+  "register_operand (operands[1], VOIDmode)
+   || register_operand (operands[2], VOIDmode)"
+  "@
+   v_subb%^_u32\t%0, %6, %1, %2, %5
+   v_subb%^_u32\t%0, %6, %1, %2, %5
+   v_subbrev%^_u32\t%0, %6, %2, %1, %5
+   v_subbrev%^_u32\t%0, %6, %2, %1, %5"
+  [(set_attr "type" "vop2,vop3b,vop2,vop3b")
+   (set_attr "length" "8")])
+
+(define_insn_and_split "addv64di3_vector"
+  [(set (match_operand:V64DI 0 "register_operand"		  "=  &v")
+	(vec_merge:V64DI
+	  (plus:V64DI
+	    (match_operand:V64DI 1 "register_operand"		  "%  v0")
+	    (match_operand:V64DI 2 "gcn_alu_operand"		  "vSSB0"))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand" "   U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "    e")))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "gcn_can_split_p  (V64DImode, operands[0])
+   && gcn_can_split_p (V64DImode, operands[1])
+   && gcn_can_split_p (V64DImode, operands[2])
+   && gcn_can_split_p (V64DImode, operands[4])"
+  [(const_int 0)]
+  {
+    rtx vcc = gen_rtx_REG (DImode, VCC_REG);
+    emit_insn (gen_addv64si3_vector_vcc
+		(gcn_operand_part (V64DImode, operands[0], 0),
+		 gcn_operand_part (V64DImode, operands[1], 0),
+		 gcn_operand_part (V64DImode, operands[2], 0),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 0),
+		 vcc, gcn_gen_undef (DImode)));
+    emit_insn (gen_addcv64si3_vec
+		(gcn_operand_part (V64DImode, operands[0], 1),
+		 gcn_operand_part (V64DImode, operands[1], 1),
+		 gcn_operand_part (V64DImode, operands[2], 1),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 1),
+		 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		 gcn_vec_constant (V64SImode, 0),
+		 gcn_gen_undef (DImode)));
+    DONE;
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8")])
+
+(define_insn_and_split "subv64di3_vector"
+  [(set (match_operand:V64DI 0 "register_operand"	       "=  &v,   &v")
+	(vec_merge:V64DI
+	  (minus:V64DI
+	    (match_operand:V64DI 1 "gcn_alu_operand"	       "vSSB0,   v0")
+	    (match_operand:V64DI 2 "gcn_alu_operand"	       "   v0,vSSB0"))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand"
+							       "   U0,   U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"	       "    e,    e")))
+   (clobber (reg:DI VCC_REG))]
+  "register_operand (operands[1], VOIDmode)
+   || register_operand (operands[2], VOIDmode)"
+  "#"
+  "gcn_can_split_p  (V64DImode, operands[0])
+   && gcn_can_split_p (V64DImode, operands[1])
+   && gcn_can_split_p (V64DImode, operands[2])
+   && gcn_can_split_p (V64DImode, operands[4])"
+  [(const_int 0)]
+  {
+    rtx vcc = gen_rtx_REG (DImode, VCC_REG);
+    emit_insn (gen_subv64si3_vector_vcc
+		(gcn_operand_part (V64DImode, operands[0], 0),
+		 gcn_operand_part (V64DImode, operands[1], 0),
+		 gcn_operand_part (V64DImode, operands[2], 0),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 0),
+		 vcc, gcn_gen_undef (DImode)));
+    emit_insn (gen_subcv64si3_vec
+		(gcn_operand_part (V64DImode, operands[0], 1),
+		 gcn_operand_part (V64DImode, operands[1], 1),
+		 gcn_operand_part (V64DImode, operands[2], 1),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 1),
+		 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		 gcn_vec_constant (V64SImode, 0),
+		 gcn_gen_undef (DImode)));
+    DONE;
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8,8")])
+
+(define_insn_and_split "addv64di3_vector_dup"
+  [(set (match_operand:V64DI 0 "register_operand"		  "= &v")
+	(vec_merge:V64DI
+	  (plus:V64DI
+	    (match_operand:V64DI 1 "register_operand"		  "  v0")
+	    (vec_duplicate:V64DI
+	      (match_operand:DI 2 "gcn_alu_operand"		  "SSDB")))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand" "  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "   e")))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "gcn_can_split_p  (V64DImode, operands[0])
+   && gcn_can_split_p (V64DImode, operands[1])
+   && gcn_can_split_p (V64DImode, operands[2])
+   && gcn_can_split_p (V64DImode, operands[4])"
+  [(const_int 0)]
+  {
+    rtx vcc = gen_rtx_REG (DImode, VCC_REG);
+    emit_insn (gen_addv64si3_vector_vcc_dup
+		(gcn_operand_part (V64DImode, operands[0], 0),
+		 gcn_operand_part (V64DImode, operands[1], 0),
+		 gcn_operand_part (DImode, operands[2], 0),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 0),
+		 vcc, gcn_gen_undef (DImode)));
+    emit_insn (gen_addcv64si3_vec_dup
+		(gcn_operand_part (V64DImode, operands[0], 1),
+		 gcn_operand_part (V64DImode, operands[1], 1),
+		 gcn_operand_part (DImode, operands[2], 1),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 1),
+		 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		 gcn_vec_constant (V64SImode, 0),
+		 gcn_gen_undef (DImode)));
+    DONE;
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8")])
+
+(define_insn_and_split "addv64di3_zext"
+  [(set (match_operand:V64DI 0 "register_operand"		  "=&v,&v")
+	(vec_merge:V64DI
+	  (plus:V64DI
+	    (zero_extend:V64DI
+	      (match_operand:V64SI 1 "gcn_alu_operand"		  "0vA,0vB"))
+	    (match_operand:V64DI 2 "gcn_alu_operand"		  "0vB,0vA"))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand" " U0, U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e,  e")))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "gcn_can_split_p  (V64DImode, operands[0])
+   && gcn_can_split_p (V64DImode, operands[2])
+   && gcn_can_split_p (V64DImode, operands[4])"
+  [(const_int 0)]
+  {
+    rtx vcc = gen_rtx_REG (DImode, VCC_REG);
+    emit_insn (gen_addv64si3_vector_vcc
+		(gcn_operand_part (V64DImode, operands[0], 0),
+		 operands[1],
+		 gcn_operand_part (V64DImode, operands[2], 0),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 0),
+		 vcc, gcn_gen_undef (DImode)));
+    emit_insn (gen_addcv64si3_vec
+		(gcn_operand_part (V64DImode, operands[0], 1),
+		 gcn_operand_part (V64DImode, operands[2], 1),
+		 const0_rtx,
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 1),
+		 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		 gcn_vec_constant (V64SImode, 0),
+		 gcn_gen_undef (DImode)));
+    DONE;
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8,8")])
+
+(define_insn_and_split "addv64di3_zext_dup"
+  [(set (match_operand:V64DI 0 "register_operand"		  "=&v")
+	(vec_merge:V64DI
+	  (plus:V64DI
+	    (zero_extend:V64DI
+	      (vec_duplicate:V64SI
+		(match_operand:SI 1 "gcn_alu_operand"		  "BSS")))
+	    (match_operand:V64DI 2 "gcn_alu_operand"		  "vA0"))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand" " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e")))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "gcn_can_split_p  (V64DImode, operands[0])
+   && gcn_can_split_p (V64DImode, operands[2])
+   && gcn_can_split_p (V64DImode, operands[4])"
+  [(const_int 0)]
+  {
+    rtx vcc = gen_rtx_REG (DImode, VCC_REG);
+    emit_insn (gen_addv64si3_vector_vcc_dup
+		(gcn_operand_part (V64DImode, operands[0], 0),
+		 gcn_operand_part (DImode, operands[1], 0),
+		 gcn_operand_part (V64DImode, operands[2], 0),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 0),
+		 vcc, gcn_gen_undef (DImode)));
+    emit_insn (gen_addcv64si3_vec
+		(gcn_operand_part (V64DImode, operands[0], 1),
+		 gcn_operand_part (V64DImode, operands[2], 1),
+		 const0_rtx, operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 1),
+		 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		 gcn_vec_constant (V64SImode, 0),
+		 gcn_gen_undef (DImode)));
+    DONE;
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8")])
+
+(define_insn_and_split "addv64di3_zext_dup2"
+  [(set (match_operand:V64DI 0 "register_operand"		       "= v")
+	(vec_merge:V64DI
+	  (plus:V64DI
+	    (zero_extend:V64DI (match_operand:V64SI 1 "gcn_alu_operand"
+								       " vA"))
+	    (vec_duplicate:V64DI (match_operand:DI 2 "gcn_alu_operand" "BSS")))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand"      " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		       "  e")))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "gcn_can_split_p  (V64DImode, operands[0])
+   && gcn_can_split_p (V64DImode, operands[4])"
+  [(const_int 0)]
+  {
+    rtx vcc = gen_rtx_REG (DImode, VCC_REG);
+    emit_insn (gen_addv64si3_vector_vcc_dup
+		(gcn_operand_part (V64DImode, operands[0], 0),
+		 operands[1],
+		 gcn_operand_part (DImode, operands[2], 0),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 0),
+		 vcc, gcn_gen_undef (DImode)));
+    rtx dsthi = gcn_operand_part (V64DImode, operands[0], 1);
+    emit_insn (gen_vec_duplicatev64si_exec
+		(dsthi, gcn_operand_part (DImode, operands[2], 1),
+		 operands[3], gcn_gen_undef (V64SImode)));
+    emit_insn (gen_addcv64si3_vec
+		(dsthi, dsthi, const0_rtx, operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 1),
+		 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		 gcn_vec_constant (V64SImode, 0),
+		 gcn_gen_undef (DImode)));
+    DONE;
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8")])
+
+(define_insn_and_split "addv64di3_sext_dup2"
+  [(set (match_operand:V64DI 0 "register_operand"		       "= v")
+	(vec_merge:V64DI
+	  (plus:V64DI
+	    (sign_extend:V64DI (match_operand:V64SI 1 "gcn_alu_operand"
+								       " vA"))
+	    (vec_duplicate:V64DI (match_operand:DI 2 "gcn_alu_operand" "BSS")))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand"      " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		       "  e")))
+   (clobber (match_scratch:V64SI 5				       "=&v"))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "gcn_can_split_p  (V64DImode, operands[0])
+   && gcn_can_split_p (V64DImode, operands[4])"
+  [(const_int 0)]
+  {
+    rtx vcc = gen_rtx_REG (DImode, VCC_REG);
+    emit_insn (gen_ashrv64si3_vector (operands[5], operands[1], GEN_INT (31),
+				      operands[3], gcn_gen_undef (V64SImode)));
+    emit_insn (gen_addv64si3_vector_vcc_dup
+		(gcn_operand_part (V64DImode, operands[0], 0),
+		 operands[1],
+		 gcn_operand_part (DImode, operands[2], 0),
+		 operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 0),
+		 vcc, gcn_gen_undef (DImode)));
+    rtx dsthi = gcn_operand_part (V64DImode, operands[0], 1);
+    emit_insn (gen_vec_duplicatev64si_exec
+		(dsthi, gcn_operand_part (DImode, operands[2], 1),
+		 operands[3], gcn_gen_undef (V64SImode)));
+    emit_insn (gen_addcv64si3_vec
+		(dsthi, dsthi, operands[5], operands[3],
+		 gcn_operand_part (V64DImode, operands[4], 1),
+		 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		 gcn_vec_constant (V64SImode, 0),
+		 gcn_gen_undef (DImode)));
+    DONE;
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8")])
+
+(define_insn "addv64di3_scalarsi"
+  [(set (match_operand:V64DI 0 "register_operand"	       "=&v, v")
+	(plus:V64DI (vec_duplicate:V64DI
+		      (zero_extend:DI
+			(match_operand:SI 2 "register_operand" " Sg,Sg")))
+		    (match_operand:V64DI 1 "register_operand"  "  v, 0")))]
+  ""
+  "v_add%^_u32\t%L0, vcc, %2, %L1\;v_addc%^_u32\t%H0, vcc, 0, %H1, vcc"
+  [(set_attr "type" "vmult")
+   (set_attr "length" "8")
+   (set_attr "exec" "full")])
+
+;; }}}
+;; {{{ DS memory ALU: add/sub
+
+(define_mode_iterator DS_ARITH_MODE [V64SI V64SF V64DI])
+(define_mode_iterator DS_ARITH_SCALAR_MODE [SI SF DI])
+
+;; FIXME: the vector patterns probably need RD expanded to a vector of
+;;        addresses.  For now, the only way a vector can get into LDS is
+;;        if the user puts it there manually.
+;;
+;; FIXME: the scalar patterns are probably fine in themselves, but need to be
+;;        checked to see if anything can ever use them.
+
+(define_insn "add<mode>3_ds_vector"
+  [(set (match_operand:DS_ARITH_MODE 0 "gcn_ds_memory_operand"	      "=RD")
+	(vec_merge:DS_ARITH_MODE
+	  (plus:DS_ARITH_MODE
+	    (match_operand:DS_ARITH_MODE 1 "gcn_ds_memory_operand"    "%RD")
+	    (match_operand:DS_ARITH_MODE 2 "register_operand"	      "  v"))
+	  (match_operand:DS_ARITH_MODE 4 "gcn_register_ds_or_unspec_operand"
+								      " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		      "  e")))]
+  "rtx_equal_p (operands[0], operands[1])"
+  "ds_add%u0\t%A0, %2%O0"
+  [(set_attr "type" "ds")
+   (set_attr "length" "8")])
+
+(define_insn "add<mode>3_ds_scalar"
+  [(set (match_operand:DS_ARITH_SCALAR_MODE 0 "gcn_ds_memory_operand"  "=RD")
+	(plus:DS_ARITH_SCALAR_MODE
+	  (match_operand:DS_ARITH_SCALAR_MODE 1 "gcn_ds_memory_operand"
+								       "%RD")
+	  (match_operand:DS_ARITH_SCALAR_MODE 2 "register_operand"     "  v")))
+   (use (match_operand:DI 3 "gcn_exec_operand"			       "  e"))]
+  "rtx_equal_p (operands[0], operands[1])"
+  "ds_add%u0\t%A0, %2%O0"
+  [(set_attr "type" "ds")
+   (set_attr "length" "8")])
+
+(define_insn "sub<mode>3_ds_vector"
+  [(set (match_operand:DS_ARITH_MODE 0 "gcn_ds_memory_operand"	      "=RD")
+	(vec_merge:DS_ARITH_MODE
+	  (minus:DS_ARITH_MODE
+	    (match_operand:DS_ARITH_MODE 1 "gcn_ds_memory_operand"    " RD")
+	    (match_operand:DS_ARITH_MODE 2 "register_operand"	      "  v"))
+	  (match_operand:DS_ARITH_MODE 4 "gcn_register_ds_or_unspec_operand" 
+								      " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		      "  e")))]
+  "rtx_equal_p (operands[0], operands[1])"
+  "ds_sub%u0\t%A0, %2%O0"
+  [(set_attr "type" "ds")
+   (set_attr "length" "8")])
+
+(define_insn "sub<mode>3_ds_scalar"
+  [(set (match_operand:DS_ARITH_SCALAR_MODE 0 "gcn_ds_memory_operand"  "=RD")
+	(minus:DS_ARITH_SCALAR_MODE
+	  (match_operand:DS_ARITH_SCALAR_MODE 1 "gcn_ds_memory_operand"
+								       " RD")
+	  (match_operand:DS_ARITH_SCALAR_MODE 2 "register_operand"     "  v")))
+   (use (match_operand:DI 3 "gcn_exec_operand"			       "  e"))]
+  "rtx_equal_p (operands[0], operands[1])"
+  "ds_sub%u0\t%A0, %2%O0"
+  [(set_attr "type" "ds")
+   (set_attr "length" "8")])
+
+(define_insn "subr<mode>3_ds_vector"
+  [(set (match_operand:DS_ARITH_MODE 0 "gcn_ds_memory_operand"	      "=RD")
+	(vec_merge:DS_ARITH_MODE
+	  (minus:DS_ARITH_MODE
+	    (match_operand:DS_ARITH_MODE 2 "register_operand"	      "  v")
+	    (match_operand:DS_ARITH_MODE 1 "gcn_ds_memory_operand"    " RD"))
+	  (match_operand:DS_ARITH_MODE 4 "gcn_register_ds_or_unspec_operand"
+								      " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		      "  e")))]
+  "rtx_equal_p (operands[0], operands[1])"
+  "ds_rsub%u0\t%A0, %2%O0"
+  [(set_attr "type" "ds")
+   (set_attr "length" "8")])
+
+(define_insn "subr<mode>3_ds_scalar"
+  [(set (match_operand:DS_ARITH_SCALAR_MODE 0 "gcn_ds_memory_operand"  "=RD")
+	(minus:DS_ARITH_SCALAR_MODE
+	  (match_operand:DS_ARITH_SCALAR_MODE 2 "register_operand"     "  v")
+	  (match_operand:DS_ARITH_SCALAR_MODE 1 "gcn_ds_memory_operand" 
+								       " RD")))
+   (use (match_operand:DI 3 "gcn_exec_operand"			       "  e"))]
+  "rtx_equal_p (operands[0], operands[1])"
+  "ds_rsub%u0\t%A0, %2%O0"
+  [(set_attr "type" "ds")
+   (set_attr "length" "8")])
+
+;; }}}
+;; {{{ ALU special case: mult
+
+(define_code_iterator any_extend [sign_extend zero_extend])
+(define_code_attr sgnsuffix [(sign_extend "%i") (zero_extend "%u")])
+(define_code_attr su [(sign_extend "s") (zero_extend "u")])
+(define_code_attr u [(sign_extend "") (zero_extend "u")])
+(define_code_attr iu [(sign_extend "i") (zero_extend "u")])
+(define_code_attr e [(sign_extend "e") (zero_extend "")])
+
+(define_expand "<su>mulsi3_highpart"
+  [(parallel [(set (match_operand:SI 0 "register_operand")
+		   (truncate:SI
+		     (lshiftrt:DI
+		       (mult:DI
+			 (any_extend:DI
+			   (match_operand:SI 1 "register_operand"))
+			 (any_extend:DI
+			   (match_operand:SI 2 "gcn_vop3_operand")))
+		       (const_int 32))))
+	      (use (match_dup 3))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec_reg ();
+
+    if (CONST_INT_P (operands[2]))
+      {
+	emit_insn (gen_const_<su>mulsi3_highpart_scalar (operands[0],
+							 operands[1],
+							 operands[2],
+							 operands[3]));
+	DONE;
+      }
+  })
+
+(define_insn "<su>mulv64si3_highpart_vector"
+  [(set (match_operand:V64SI 0 "register_operand"		     "=  v")
+	(vec_merge:V64SI
+	  (truncate:V64SI
+	    (lshiftrt:V64DI
+	      (mult:V64DI
+		(any_extend:V64DI
+		  (match_operand:V64SI 1 "gcn_alu_operand"	     "  %v"))
+		(any_extend:V64DI
+		  (match_operand:V64SI 2 "gcn_alu_operand"	     "vSSB")))
+	      (const_int 32)))
+	  (match_operand:V64SI 4 "gcn_register_ds_or_unspec_operand" "  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		     "   e")))]
+  ""
+  "v_mul_hi<sgnsuffix>0\t%0, %2, %1"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "<su>mulsi3_highpart_scalar"
+  [(set (match_operand:SI 0 "register_operand"	       "= v")
+	(truncate:SI
+	  (lshiftrt:DI
+	    (mult:DI
+	      (any_extend:DI
+		(match_operand:SI 1 "register_operand" "% v"))
+	      (any_extend:DI
+		(match_operand:SI 2 "register_operand" "vSS")))
+	    (const_int 32))))
+    (use (match_operand:DI 3 "gcn_exec_reg_operand"    "  e"))]
+  ""
+  "v_mul_hi<sgnsuffix>0\t%0, %2, %1"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "const_<su>mulsi3_highpart_scalar"
+  [(set (match_operand:SI 0 "register_operand"	       "=v")
+	(truncate:SI
+	  (lshiftrt:DI
+	    (mult:DI
+	      (any_extend:DI
+		(match_operand:SI 1 "register_operand" "%v"))
+	      (match_operand:SI 2 "gcn_vop3_operand"   " A"))
+	    (const_int 32))))
+    (use (match_operand:DI 3 "gcn_exec_reg_operand"    " e"))]
+  ""
+  "v_mul_hi<sgnsuffix>0\t%0, %1, %2"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_expand "<u>mulhisi3"
+  [(parallel [(set (match_operand:SI 0 "register_operand")
+		   (mult:SI
+		     (any_extend:SI (match_operand:HI 1 "register_operand"))
+		     (any_extend:SI (match_operand:HI 2 "register_operand"))))
+	      (use (match_dup 3))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec_reg ();
+  })
+
+(define_insn "<u>mulhisi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"			"=v")
+	(mult:SI
+	  (any_extend:SI (match_operand:HI 1 "register_operand" "%v"))
+	  (any_extend:SI (match_operand:HI 2 "register_operand" " v"))))
+   (use (match_operand:DI 3 "gcn_exec_reg_operand"	        " e"))]
+  ""
+  "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:WORD_0 src1_sel:WORD_0"
+  [(set_attr "type" "vop_sdwa")
+   (set_attr "length" "8")])
+
+(define_expand "<u>mulqihi3"
+  [(parallel [(set (match_operand:HI 0 "register_operand")
+		   (mult:HI
+		     (any_extend:HI (match_operand:QI 1 "register_operand"))
+		     (any_extend:HI (match_operand:QI 2 "register_operand"))))
+	      (use (match_dup 3))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec_reg ();
+  })
+
+(define_insn "<u>mulqihi3_scalar"
+  [(set (match_operand:HI 0 "register_operand"			"=v")
+	(mult:HI
+	  (any_extend:HI (match_operand:QI 1 "register_operand" "%v"))
+	  (any_extend:HI (match_operand:QI 2 "register_operand" " v"))))
+   (use (match_operand:DI 3 "gcn_exec_reg_operand"		" e"))]
+  ""
+  "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:BYTE_0 src1_sel:BYTE_0"
+  [(set_attr "type" "vop_sdwa")
+   (set_attr "length" "8")])
+
+(define_expand "mulv64si3"
+  [(set (match_operand:V64SI 0 "register_operand")
+	(vec_merge:V64SI
+	  (mult:V64SI
+	    (match_operand:V64SI 1 "gcn_alu_operand")
+	    (match_operand:V64SI 2 "gcn_alu_operand"))
+	  (match_dup 4)
+	  (match_dup 3)))]
+  ""
+  {
+    operands[3] = gcn_full_exec_reg ();
+    operands[4] = gcn_gen_undef (V64SImode);
+  })
+
+(define_insn "mulv64si3_vector"
+  [(set (match_operand:V64SI 0 "register_operand"		  "=   v")
+	(vec_merge:V64SI
+	  (mult:V64SI
+	    (match_operand:V64SI 1 "gcn_alu_operand"		  "%vSvA")
+	    (match_operand:V64SI 2 "gcn_alu_operand"		  " vSvA"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" "   U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "    e")))]
+  ""
+  "v_mul_lo_u32\t%0, %1, %2"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "mulv64si3_vector_dup"
+  [(set (match_operand:V64SI 0 "register_operand"		  "=   v")
+	(vec_merge:V64SI
+	  (mult:V64SI
+	    (match_operand:V64SI 1 "gcn_alu_operand"		  "%vSvA")
+	    (vec_duplicate:V64SI
+	      (match_operand:SI 2 "gcn_alu_operand"		  "  SvA")))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" "   U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "    e")))]
+  ""
+  "v_mul_lo_u32\t%0, %1, %2"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_expand "mulv64di3"
+  [(match_operand:V64DI 0 "register_operand")
+   (match_operand:V64DI 1 "gcn_alu_operand")
+   (match_operand:V64DI 2 "gcn_alu_operand")]
+  ""
+  {
+    emit_insn (gen_mulv64di3_vector (operands[0], operands[1], operands[2],
+				     gcn_full_exec_reg (),
+				     gcn_gen_undef (V64DImode)));
+    DONE;
+  })
+
+(define_insn_and_split "mulv64di3_vector"
+  [(set (match_operand:V64DI 0 "register_operand"		  "=&v")
+	(vec_merge:V64DI
+	  (mult:V64DI
+	    (match_operand:V64DI 1 "gcn_alu_operand"		  "% v")
+	    (match_operand:V64DI 2 "gcn_alu_operand"		  "vDA"))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand" " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e")))
+   (clobber (match_scratch:V64SI 5                                "=&v"))]
+  ""
+  "#"
+  "reload_completed"
+  [(const_int 0)]
+  {
+    rtx out_lo = gcn_operand_part (V64DImode, operands[0], 0);
+    rtx out_hi = gcn_operand_part (V64DImode, operands[0], 1);
+    rtx left_lo = gcn_operand_part (V64DImode, operands[1], 0);
+    rtx left_hi = gcn_operand_part (V64DImode, operands[1], 1);
+    rtx right_lo = gcn_operand_part (V64DImode, operands[2], 0);
+    rtx right_hi = gcn_operand_part (V64DImode, operands[2], 1);
+    rtx exec = operands[3];
+    rtx tmp = operands[5];
+
+    rtx old_lo, old_hi;
+    if (GET_CODE (operands[4]) == UNSPEC)
+      {
+	old_lo = old_hi = gcn_gen_undef (V64SImode);
+      }
+    else
+      {
+        old_lo = gcn_operand_part (V64DImode, operands[4], 0);
+        old_hi = gcn_operand_part (V64DImode, operands[4], 1);
+      }
+
+    rtx undef = gcn_gen_undef (V64SImode);
+
+    emit_insn (gen_mulv64si3_vector (out_lo, left_lo, right_lo, exec, old_lo));
+    emit_insn (gen_umulv64si3_highpart_vector (out_hi, left_lo, right_lo,
+					       exec, old_hi));
+    emit_insn (gen_mulv64si3_vector (tmp, left_hi, right_lo, exec, undef));
+    emit_insn (gen_addv64si3_vector (out_hi, out_hi, tmp, exec, out_hi));
+    emit_insn (gen_mulv64si3_vector (tmp, left_lo, right_hi, exec, undef));
+    emit_insn (gen_addv64si3_vector (out_hi, out_hi, tmp, exec, out_hi));
+    emit_insn (gen_mulv64si3_vector (tmp, left_hi, right_hi, exec, undef));
+    emit_insn (gen_addv64si3_vector (out_hi, out_hi, tmp, exec, out_hi));
+    DONE;
+  })
+
+(define_insn_and_split "mulv64di3_vector_zext"
+  [(set (match_operand:V64DI 0 "register_operand"		  "=&v")
+	(vec_merge:V64DI
+	  (mult:V64DI
+	    (zero_extend:V64DI
+	      (match_operand:V64SI 1 "gcn_alu_operand"		  "  v"))
+	    (match_operand:V64DI 2 "gcn_alu_operand"		  "vDA"))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand" " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e")))
+   (clobber (match_scratch:V64SI 5                                "=&v"))]
+  ""
+  "#"
+  "reload_completed"
+  [(const_int 0)]
+  {
+    rtx out_lo = gcn_operand_part (V64DImode, operands[0], 0);
+    rtx out_hi = gcn_operand_part (V64DImode, operands[0], 1);
+    rtx left = operands[1];
+    rtx right_lo = gcn_operand_part (V64DImode, operands[2], 0);
+    rtx right_hi = gcn_operand_part (V64DImode, operands[2], 1);
+    rtx exec = operands[3];
+    rtx tmp = operands[5];
+
+    rtx old_lo, old_hi;
+    if (GET_CODE (operands[4]) == UNSPEC)
+      {
+	old_lo = old_hi = gcn_gen_undef (V64SImode);
+      }
+    else
+      {
+        old_lo = gcn_operand_part (V64DImode, operands[4], 0);
+        old_hi = gcn_operand_part (V64DImode, operands[4], 1);
+      }
+
+    rtx undef = gcn_gen_undef (V64SImode);
+
+    emit_insn (gen_mulv64si3_vector (out_lo, left, right_lo, exec, old_lo));
+    emit_insn (gen_umulv64si3_highpart_vector (out_hi, left, right_lo,
+					       exec, old_hi));
+    emit_insn (gen_mulv64si3_vector (tmp, left, right_hi, exec, undef));
+    emit_insn (gen_addv64si3_vector (out_hi, out_hi, tmp, exec, out_hi));
+    DONE;
+  })
+
+(define_insn_and_split "mulv64di3_vector_zext_dup2"
+  [(set (match_operand:V64DI 0 "register_operand"		  "= &v")
+	(vec_merge:V64DI
+	  (mult:V64DI
+	    (zero_extend:V64DI
+	      (match_operand:V64SI 1 "gcn_alu_operand"		  "   v"))
+	    (vec_duplicate:V64DI
+	      (match_operand:DI 2 "gcn_alu_operand"		  "SSDA")))
+	  (match_operand:V64DI 4 "gcn_register_or_unspec_operand" "  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "   e")))
+   (clobber (match_scratch:V64SI 5                                "= &v"))]
+  ""
+  "#"
+  "reload_completed"
+  [(const_int 0)]
+  {
+    rtx out_lo = gcn_operand_part (V64DImode, operands[0], 0);
+    rtx out_hi = gcn_operand_part (V64DImode, operands[0], 1);
+    rtx left = operands[1];
+    rtx right_lo = gcn_operand_part (V64DImode, operands[2], 0);
+    rtx right_hi = gcn_operand_part (V64DImode, operands[2], 1);
+    rtx exec = operands[3];
+    rtx tmp = operands[5];
+
+    rtx old_lo, old_hi;
+    if (GET_CODE (operands[4]) == UNSPEC)
+      {
+	old_lo = old_hi = gcn_gen_undef (V64SImode);
+      }
+    else
+      {
+        old_lo = gcn_operand_part (V64DImode, operands[4], 0);
+        old_hi = gcn_operand_part (V64DImode, operands[4], 1);
+      }
+
+    rtx undef = gcn_gen_undef (V64SImode);
+
+    emit_insn (gen_mulv64si3_vector (out_lo, left, right_lo, exec, old_lo));
+    emit_insn (gen_umulv64si3_highpart_vector (out_hi, left, right_lo,
+					       exec, old_hi));
+    emit_insn (gen_mulv64si3_vector (tmp, left, right_hi, exec, undef));
+    emit_insn (gen_addv64si3_vector (out_hi, out_hi, tmp, exec, out_hi));
+    DONE;
+  })
+
+;; }}}
+;; {{{ ALU generic case
+
+(define_mode_iterator VEC_INT_MODE [V64QI V64HI V64SI V64DI])
+
+(define_code_iterator bitop [and ior xor])
+(define_code_iterator bitunop [not popcount])
+(define_code_iterator shiftop [ashift lshiftrt ashiftrt])
+(define_code_iterator minmaxop [smin smax umin umax])
+
+(define_expand "<expander><mode>3"
+  [(set (match_operand:VEC_INT_MODE 0 "gcn_valu_dst_operand")
+	(vec_merge:VEC_INT_MODE
+	  (bitop:VEC_INT_MODE
+	    (match_operand:VEC_INT_MODE 1 "gcn_valu_src0_operand")
+	    (match_operand:VEC_INT_MODE 2 "gcn_valu_src1com_operand"))
+	  (match_dup 4)
+	  (match_dup 3)))]
+  ""
+  {
+    operands[3] = gcn_full_exec_reg ();
+    operands[4] = gcn_gen_undef (<MODE>mode);
+  })
+
+(define_expand "<expander>v64si3"
+  [(set (match_operand:V64SI 0 "register_operand")
+	(vec_merge:V64SI
+	  (shiftop:V64SI
+	    (match_operand:V64SI 1 "register_operand")
+	    (match_operand:SI 2 "gcn_alu_operand"))
+	  (match_dup 4)
+	  (match_dup 3)))]
+  ""
+  {
+    operands[3] = gcn_full_exec_reg ();
+    operands[4] = gcn_gen_undef (V64SImode);
+  })
+
+(define_expand "v<expander>v64si3"
+  [(set (match_operand:V64SI 0 "register_operand")
+	(vec_merge:V64SI
+	  (shiftop:V64SI
+	    (match_operand:V64SI 1 "register_operand")
+	    (match_operand:V64SI 2 "gcn_alu_operand"))
+	  (match_dup 4)
+	  (match_dup 3)))]
+  ""
+  {
+    operands[3] = gcn_full_exec_reg ();
+    operands[4] = gcn_gen_undef (V64SImode);
+  })
+
+(define_expand "<expander><mode>3"
+  [(set (match_operand:VEC_1REG_INT_MODE 0 "gcn_valu_dst_operand")
+	(vec_merge:VEC_1REG_INT_MODE
+	  (minmaxop:VEC_1REG_INT_MODE
+	    (match_operand:VEC_1REG_INT_MODE 1 "gcn_valu_src0_operand")
+	    (match_operand:VEC_1REG_INT_MODE 2 "gcn_valu_src1_operand"))
+	  (match_dup 4)
+	  (match_dup 3)))]
+  "<MODE>mode != V64QImode"
+  {
+    operands[3] = gcn_full_exec_reg ();
+    operands[4] = gcn_gen_undef (<MODE>mode);
+  })
+
+(define_insn "<expander><mode>2_vector"
+  [(set (match_operand:VEC_1REG_INT_MODE 0 "gcn_valu_dst_operand"    "=  v")
+	(vec_merge:VEC_1REG_INT_MODE
+	  (bitunop:VEC_1REG_INT_MODE
+	    (match_operand:VEC_1REG_INT_MODE 1 "gcn_valu_src0_operand"
+								     "vSSB"))
+	  (match_operand:VEC_1REG_INT_MODE 3 "gcn_register_or_unspec_operand"
+								     "  U0")
+	  (match_operand:DI 2 "gcn_exec_reg_operand"		     "   e")))]
+  ""
+  "v_<mnemonic>0\t%0, %1"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+(define_insn "<expander><mode>3_vector"
+  [(set (match_operand:VEC_1REG_INT_MODE 0 "gcn_valu_dst_operand" "=  v,RD")
+	(vec_merge:VEC_1REG_INT_MODE
+	  (bitop:VEC_1REG_INT_MODE
+	    (match_operand:VEC_1REG_INT_MODE 1 "gcn_valu_src0_operand"
+								  "%  v, 0")
+	    (match_operand:VEC_1REG_INT_MODE 2 "gcn_valu_src1com_operand"
+								  "vSSB, v"))
+	  (match_operand:VEC_1REG_INT_MODE 4
+	    "gcn_register_ds_or_unspec_operand"			  "  U0,U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "   e, e")))]
+  "!memory_operand (operands[0], VOIDmode)
+   || (rtx_equal_p (operands[0], operands[1]) 
+       && register_operand (operands[2], VOIDmode))"
+  "@
+   v_<mnemonic>0\t%0, %2, %1
+   ds_<mnemonic>0\t%A0, %2%O0"
+  [(set_attr "type" "vop2,ds")
+   (set_attr "length" "8,8")])
+
+(define_insn "<expander><mode>2_vscalar"
+  [(set (match_operand:SCALAR_1REG_INT_MODE 0 "gcn_valu_dst_operand"  "=  v")
+	(bitunop:SCALAR_1REG_INT_MODE
+	  (match_operand:SCALAR_1REG_INT_MODE 1 "gcn_valu_src0_operand"
+								      "vSSB")))
+   (use (match_operand:DI 2 "gcn_exec_operand"			      "   e"))]
+  ""
+  "v_<mnemonic>0\t%0, %1"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+(define_insn "<expander><mode>3_scalar"
+  [(set (match_operand:SCALAR_1REG_INT_MODE 0 "gcn_valu_dst_operand"
+								   "=  v,RD")
+	(vec_and_scalar_com:SCALAR_1REG_INT_MODE
+	  (match_operand:SCALAR_1REG_INT_MODE 1 "gcn_valu_src0_operand"
+								   "%  v, 0")
+	  (match_operand:SCALAR_1REG_INT_MODE 2 "gcn_valu_src1com_operand"
+								   "vSSB, v")))
+   (use (match_operand:DI 3 "gcn_exec_operand"                     "   e, e"))]
+  "!memory_operand (operands[0], VOIDmode)
+   || (rtx_equal_p (operands[0], operands[1])
+       && register_operand (operands[2], VOIDmode))"
+  "@
+   v_<mnemonic>0\t%0, %2, %1
+   ds_<mnemonic>0\t%A0, %2%O0"
+  [(set_attr "type" "vop2,ds")
+   (set_attr "length" "8,8")])
+
+(define_insn_and_split "<expander>v64di3_vector"
+  [(set (match_operand:V64DI 0 "gcn_valu_dst_operand" "=&v,RD")
+	(vec_merge:V64DI
+	  (bitop:V64DI
+	    (match_operand:V64DI 1 "gcn_valu_src0_operand"	  "%  v,RD")
+	    (match_operand:V64DI 2 "gcn_valu_src1com_operand"	  "vSSB, v"))
+	  (match_operand:V64DI 4 "gcn_register_ds_or_unspec_operand"
+								  "  U0,U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "   e, e")))]
+  "!memory_operand (operands[0], VOIDmode)
+   || (rtx_equal_p (operands[0], operands[1])
+       && register_operand (operands[2], VOIDmode))"
+  "@
+   #
+   ds_<mnemonic>0\t%A0, %2%O0"
+  "(reload_completed && !gcn_ds_memory_operand (operands[0], V64DImode))"
+  [(set (match_dup 5)
+	(vec_merge:V64SI
+	  (bitop:V64SI (match_dup 7) (match_dup 9))
+	  (match_dup 11)
+	  (match_dup 3)))
+   (set (match_dup 6)
+	(vec_merge:V64SI
+	  (bitop:V64SI (match_dup 8) (match_dup 10))
+	  (match_dup 12)
+	  (match_dup 3)))]
+  {
+    operands[5] = gcn_operand_part (V64DImode, operands[0], 0);
+    operands[6] = gcn_operand_part (V64DImode, operands[0], 1);
+    operands[7] = gcn_operand_part (V64DImode, operands[1], 0);
+    operands[8] = gcn_operand_part (V64DImode, operands[1], 1);
+    operands[9] = gcn_operand_part (V64DImode, operands[2], 0);
+    operands[10] = gcn_operand_part (V64DImode, operands[2], 1);
+    operands[11] = gcn_operand_part (V64DImode, operands[4], 0);
+    operands[12] = gcn_operand_part (V64DImode, operands[4], 1);
+  }
+  [(set_attr "type" "vmult,ds")
+   (set_attr "length" "16,8")])
+
+(define_insn_and_split "<expander>di3_scalar"
+  [(set (match_operand:DI 0 "gcn_valu_dst_operand"	   "= &v,RD")
+	  (bitop:DI
+	    (match_operand:DI 1 "gcn_valu_src0_operand"	   "%  v,RD")
+	    (match_operand:DI 2 "gcn_valu_src1com_operand" "vSSB, v")))
+   (use (match_operand:DI 3 "gcn_exec_operand"		   "   e, e"))]
+  "!memory_operand (operands[0], VOIDmode)
+   || (rtx_equal_p (operands[0], operands[1])
+       && register_operand (operands[2], VOIDmode))"
+  "@
+   #
+   ds_<mnemonic>0\t%A0, %2%O0"
+  "(reload_completed && !gcn_ds_memory_operand (operands[0], DImode))"
+  [(parallel [(set (match_dup 4)
+		   (bitop:V64SI (match_dup 6) (match_dup 8)))
+	      (use (match_dup 3))])
+   (parallel [(set (match_dup 5)
+		   (bitop:V64SI (match_dup 7) (match_dup 9)))
+	      (use (match_dup 3))])]
+  {
+    operands[4] = gcn_operand_part (DImode, operands[0], 0);
+    operands[5] = gcn_operand_part (DImode, operands[0], 1);
+    operands[6] = gcn_operand_part (DImode, operands[1], 0);
+    operands[7] = gcn_operand_part (DImode, operands[1], 1);
+    operands[8] = gcn_operand_part (DImode, operands[2], 0);
+    operands[9] = gcn_operand_part (DImode, operands[2], 1);
+  }
+  [(set_attr "type" "vmult,ds")
+   (set_attr "length" "16,8")])
+
+(define_insn "<expander>v64si3_vector"
+  [(set (match_operand:V64SI 0 "register_operand"		  "= v")
+	(vec_merge:V64SI
+	  (shiftop:V64SI
+	    (match_operand:V64SI 1 "gcn_alu_operand"		  "  v")
+	    (match_operand:SI 2 "gcn_alu_operand"		  "SSB"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" " U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "  e")))]
+  ""
+  "v_<revmnemonic>0\t%0, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8")])
+
+(define_insn "v<expander>v64si3_vector"
+  [(set (match_operand:V64SI 0 "register_operand"		  "=v")
+	(vec_merge:V64SI
+	  (shiftop:V64SI
+	    (match_operand:V64SI 1 "gcn_alu_operand"		  " v")
+	    (match_operand:V64SI 2 "gcn_alu_operand"		  "vB"))
+	  (match_operand:V64SI 4 "gcn_register_or_unspec_operand" "U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  " e")))]
+  ""
+  "v_<revmnemonic>0\t%0, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8")])
+
+(define_insn "<expander>v64si3_full"
+  [(set (match_operand:V64SI 0 "register_operand"                "=v,v")
+	(shiftop:V64SI (match_operand:V64SI 1 "register_operand" " v,v")
+		       (match_operand:SI 2 "nonmemory_operand"   "Sg,I")))]
+  ""
+  "@
+   v_<revmnemonic>0\t%0, %2, %1
+   v_<revmnemonic>0\t%0, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "4")
+   (set_attr "exec" "full")])
+
+(define_insn "*<expander>si3_scalar"
+  [(set (match_operand:SI 0 "register_operand"  "=  v")
+	(shiftop:SI
+	  (match_operand:SI 1 "gcn_alu_operand" "   v")
+	  (match_operand:SI 2 "gcn_alu_operand" "vSSB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"  "   e"))]
+  ""
+  "v_<revmnemonic>0\t%0, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8")])
+
+(define_insn "<expander><mode>3_vector"
+  [(set (match_operand:VEC_1REG_INT_MODE 0 "gcn_valu_dst_operand" "=  v,RD")
+	(vec_merge:VEC_1REG_INT_MODE
+	  (minmaxop:VEC_1REG_INT_MODE
+	    (match_operand:VEC_1REG_INT_MODE 1 "gcn_valu_src0_operand"
+								  "%  v, 0")
+	    (match_operand:VEC_1REG_INT_MODE 2 "gcn_valu_src1com_operand"
+								  "vSSB, v"))
+	  (match_operand:VEC_1REG_INT_MODE 4
+	    "gcn_register_ds_or_unspec_operand"			  "  U0,U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		  "   e, e")))]
+  "<MODE>mode != V64QImode
+   && (!memory_operand (operands[0], VOIDmode)
+       || (rtx_equal_p (operands[0], operands[1])
+           && register_operand (operands[2], VOIDmode)))"
+  "@
+   v_<mnemonic>0\t%0, %2, %1
+   ds_<mnemonic>0\t%A0, %2%O0"
+  [(set_attr "type" "vop2,ds")
+   (set_attr "length" "8,8")])
+
+;; }}}
+;; {{{ FP binops - special cases
+
+; GCN does not directly provide a DFmode subtract instruction, so we do it by
+; adding the negated second operand to the first.
+
+(define_insn "subv64df3_vector"
+  [(set (match_operand:V64DF 0 "register_operand"		"=  v,   v")
+	(vec_merge:V64DF
+	  (minus:V64DF
+	    (match_operand:V64DF 1 "gcn_alu_operand"	        "vSSB,   v")
+	    (match_operand:V64DF 2 "gcn_alu_operand"		"   v,vSSB"))
+	  (match_operand:V64DF 4 "gcn_register_or_unspec_operand"
+								"  U0,  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		"   e,   e")))]
+  ""
+  "@
+   v_add_f64\t%0, %1, -%2
+   v_add_f64\t%0, -%2, %1"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8,8")])
+
+(define_insn "subdf_scalar"
+  [(set (match_operand:DF 0 "register_operand"  "=  v,   v")
+	(minus:DF
+	  (match_operand:DF 1 "gcn_alu_operand" "vSSB,   v")
+	  (match_operand:DF 2 "gcn_alu_operand" "   v,vSSB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"  "   e,   e"))]
+  ""
+  "@
+   v_add_f64\t%0, %1, -%2
+   v_add_f64\t%0, -%2, %1"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8,8")])
+
+;; }}}
+;; {{{ FP binops - generic
+
+(define_mode_iterator VEC_FP_MODE [V64HF V64SF V64DF])
+(define_mode_iterator VEC_FP_1REG_MODE [V64HF V64SF])
+(define_mode_iterator FP_MODE [HF SF DF])
+(define_mode_iterator FP_1REG_MODE [HF SF])
+
+(define_code_iterator comm_fp [plus mult smin smax])
+(define_code_iterator nocomm_fp [minus])
+(define_code_iterator all_fp [plus mult minus smin smax])
+
+(define_insn "<expander><mode>3_vector"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand"		     "=  v")
+	(vec_merge:VEC_FP_MODE
+	  (comm_fp:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_alu_operand"	     "%  v")
+	    (match_operand:VEC_FP_MODE 2 "gcn_alu_operand"	     "vSSB"))
+	  (match_operand:VEC_FP_MODE 4 "gcn_register_or_unspec_operand"
+								     "  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		     "   e")))]
+  ""
+  "v_<mnemonic>0\t%0, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8")])
+
+(define_insn "<expander><mode>3_scalar"
+  [(set (match_operand:FP_MODE 0 "gcn_valu_dst_operand"    "=  v,  RL")
+	(comm_fp:FP_MODE
+	  (match_operand:FP_MODE 1 "gcn_valu_src0_operand" "%  v,   0")
+	  (match_operand:FP_MODE 2 "gcn_valu_src1_operand" "vSSB,vSSB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"             "   e,   e"))]
+  ""
+  "@
+  v_<mnemonic>0\t%0, %2, %1
+  v_<mnemonic>0\t%0, %1%O0"
+  [(set_attr "type" "vop2,ds")
+   (set_attr "length" "8")])
+
+(define_insn "<expander><mode>3_vector"
+  [(set (match_operand:VEC_FP_1REG_MODE 0 "register_operand"    "=  v,   v")
+	(vec_merge:VEC_FP_1REG_MODE
+	  (nocomm_fp:VEC_FP_1REG_MODE
+	    (match_operand:VEC_FP_1REG_MODE 1 "gcn_alu_operand" "vSSB,   v")
+	    (match_operand:VEC_FP_1REG_MODE 2 "gcn_alu_operand" "   v,vSSB"))
+	  (match_operand:VEC_FP_1REG_MODE 4 "gcn_register_or_unspec_operand"
+								"  U0,  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		"   e,   e")))]
+  ""
+  "@
+   v_<mnemonic>0\t%0, %1, %2
+   v_<revmnemonic>0\t%0, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8,8")])
+
+(define_insn "<expander><mode>3_scalar"
+  [(set (match_operand:FP_1REG_MODE 0 "register_operand"  "=  v,   v")
+	(nocomm_fp:FP_1REG_MODE
+	  (match_operand:FP_1REG_MODE 1 "gcn_alu_operand" "vSSB,   v")
+	  (match_operand:FP_1REG_MODE 2 "gcn_alu_operand" "   v,vSSB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"		  "   e,   e"))]
+  ""
+  "@
+   v_<mnemonic>0\t%0, %1, %2
+   v_<revmnemonic>0\t%0, %2, %1"
+  [(set_attr "type" "vop2")
+   (set_attr "length" "8,8")])
+
+(define_expand "<expander><mode>3"
+  [(set (match_operand:VEC_FP_MODE 0 "gcn_valu_dst_operand")
+	(vec_merge:VEC_FP_MODE
+	  (all_fp:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_valu_src0_operand")
+	    (match_operand:VEC_FP_MODE 2 "gcn_valu_src1_operand"))
+	  (match_dup 4)
+	  (match_dup 3)))]
+  ""
+  {
+    operands[3] = gcn_full_exec_reg ();
+    operands[4] = gcn_gen_undef (<MODE>mode);
+  })
+
+(define_expand "<expander><mode>3"
+  [(parallel [(set (match_operand:FP_MODE 0 "gcn_valu_dst_operand")
+		   (all_fp:FP_MODE
+		     (match_operand:FP_MODE 1 "gcn_valu_src0_operand")
+		     (match_operand:FP_MODE 2 "gcn_valu_src1_operand")))
+	      (use (match_dup 3))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec ();
+  })
+
+;; }}}
+;; {{{ FP unops
+
+(define_insn "abs<mode>2"
+  [(set (match_operand:FP_MODE 0 "register_operand"		 "=v")
+	(abs:FP_MODE (match_operand:FP_MODE 1 "register_operand" " v")))]
+  ""
+  "v_add%i0\t%0, 0, |%1|"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_expand "abs<mode>2"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand")
+	(abs:VEC_FP_MODE (match_operand:VEC_FP_MODE 1 "register_operand")))]
+  ""
+  {
+    emit_insn (gen_abs<mode>2_vector (operands[0], operands[1],
+				      gcn_full_exec_reg (),
+				      gcn_gen_undef (<MODE>mode)));
+    DONE;
+  })
+
+(define_insn "abs<mode>2_vector"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand"		       "=v")
+	(vec_merge:VEC_FP_MODE
+	  (abs:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "register_operand"	       " v"))
+	  (match_operand:VEC_FP_MODE 3 "gcn_register_or_unspec_operand"
+								       "U0")
+	  (match_operand:DI 2 "gcn_exec_reg_operand"		       " e")))]
+  ""
+  "v_add%i0\t%0, 0, |%1|"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_expand "neg<mode>2"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand")
+	(neg:VEC_FP_MODE (match_operand:VEC_FP_MODE 1 "register_operand")))]
+  ""
+  {
+    emit_insn (gen_neg<mode>2_vector (operands[0], operands[1],
+				      gcn_full_exec_reg (),
+				      gcn_gen_undef (<MODE>mode)));
+    DONE;
+  })
+
+(define_insn "neg<mode>2_vector"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand"		       "=v")
+	(vec_merge:VEC_FP_MODE
+	  (neg:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "register_operand"	       " v"))
+	  (match_operand:VEC_FP_MODE 3 "gcn_register_or_unspec_operand" 
+								       "U0")
+	  (match_operand:DI 2 "gcn_exec_reg_operand"		       " e")))]
+  ""
+  "v_add%i0\t%0, 0, -%1"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "sqrt<mode>_vector"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand"		     "=  v")
+	(vec_merge:VEC_FP_MODE
+	  (sqrt:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_alu_operand"	     "vSSB"))
+	  (match_operand:VEC_FP_MODE 3 "gcn_register_or_unspec_operand"
+								     "  U0")
+	  (match_operand:DI 2 "gcn_exec_reg_operand"		     "   e")))]
+  "flag_unsafe_math_optimizations"
+  "v_sqrt%i0\t%0, %1"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+(define_insn "sqrt<mode>_scalar"
+  [(set (match_operand:FP_MODE 0 "register_operand"  "=  v")
+	(sqrt:FP_MODE
+	  (match_operand:FP_MODE 1 "gcn_alu_operand" "vSSB")))
+   (use (match_operand:DI 2 "gcn_exec_operand"	     "   e"))]
+  "flag_unsafe_math_optimizations"
+  "v_sqrt%i0\t%0, %1"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand")
+	(vec_merge:VEC_FP_MODE
+	  (sqrt:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_alu_operand"))
+	  (match_dup 3)
+	  (match_dup 2)))]
+  "flag_unsafe_math_optimizations"
+  {
+    operands[2] = gcn_full_exec_reg ();
+    operands[3] = gcn_gen_undef (<MODE>mode);
+  })
+
+(define_expand "sqrt<mode>2"
+  [(parallel [(set (match_operand:FP_MODE 0 "register_operand")
+		   (sqrt:FP_MODE
+		     (match_operand:FP_MODE 1 "gcn_alu_operand")))
+	      (use (match_dup 2))])]
+  "flag_unsafe_math_optimizations"
+  {
+    operands[2] = gcn_scalar_exec ();
+  })
+
+;; }}}
+;; {{{ FP fused multiply and add
+
+(define_insn "fma<mode>_vector"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand"		"=  v,   v")
+	(vec_merge:VEC_FP_MODE
+	  (fma:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_alu_operand"	"% vA,  vA")
+	    (match_operand:VEC_FP_MODE 2 "gcn_alu_operand"	"  vA,vSSA")
+	    (match_operand:VEC_FP_MODE 3 "gcn_alu_operand"	"vSSA,  vA"))
+	  (match_operand:VEC_FP_MODE 5 "gcn_register_or_unspec_operand"
+								"  U0,  U0")
+	  (match_operand:DI 4 "gcn_exec_reg_operand"		"   e,   e")))]
+  ""
+  "v_fma%i0\t%0, %1, %2, %3"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "fma<mode>_vector_negop2"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand"    "=  v,   v,   v")
+	(vec_merge:VEC_FP_MODE
+	  (fma:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_alu_operand" "  vA,  vA,vSSA")
+	    (neg:VEC_FP_MODE
+	      (match_operand:VEC_FP_MODE 2 "gcn_alu_operand" 
+							   "  vA,vSSA,  vA"))
+	    (match_operand:VEC_FP_MODE 3 "gcn_alu_operand" "vSSA,  vA,  vA"))
+	  (match_operand:VEC_FP_MODE 5 "gcn_register_or_unspec_operand"
+							   "  U0,  U0,  U0")
+	  (match_operand:DI 4 "gcn_exec_reg_operand"	   "   e,   e,   e")))]
+  ""
+  "v_fma%i0\t%0, %1, -%2, %3"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "fma<mode>_scalar"
+  [(set (match_operand:FP_MODE 0 "register_operand"  "=  v,   v")
+	(fma:FP_MODE
+	  (match_operand:FP_MODE 1 "gcn_alu_operand" "% vA,  vA")
+	  (match_operand:FP_MODE 2 "gcn_alu_operand" "  vA,vSSA")
+	  (match_operand:FP_MODE 3 "gcn_alu_operand" "vSSA,  vA")))
+   (use (match_operand:DI 4 "gcn_exec_operand"	     "   e,   e"))]
+  ""
+  "v_fma%i0\t%0, %1, %2, %3"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_insn "fma<mode>_scalar_negop2"
+  [(set (match_operand:FP_MODE 0 "register_operand"    "=  v,   v,   v")
+	(fma:FP_MODE
+	  (match_operand:FP_MODE 1 "gcn_alu_operand"   "  vA,  vA,vSSA")
+	  (neg:FP_MODE
+	    (match_operand:FP_MODE 2 "gcn_alu_operand" "  vA,vSSA,  vA"))
+	  (match_operand:FP_MODE 3 "gcn_alu_operand"   "vSSA,  vA,  vA")))
+   (use (match_operand:DI 4 "gcn_exec_operand"	       "   e,   e,   e"))]
+  ""
+  "v_fma%i0\t%0, %1, -%2, %3"
+  [(set_attr "type" "vop3a")
+   (set_attr "length" "8")])
+
+(define_expand "fma<mode>4"
+  [(set (match_operand:VEC_FP_MODE 0 "gcn_valu_dst_operand")
+	(vec_merge:VEC_FP_MODE
+	  (fma:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_valu_src1_operand")
+	    (match_operand:VEC_FP_MODE 2 "gcn_valu_src1_operand")
+	    (match_operand:VEC_FP_MODE 3 "gcn_valu_src1_operand"))
+	  (match_dup 5)
+	  (match_dup 4)))]
+  ""
+  {
+    operands[4] = gcn_full_exec_reg ();
+    operands[5] = gcn_gen_undef (<MODE>mode);
+  })
+
+(define_expand "fma<mode>4_negop2"
+  [(set (match_operand:VEC_FP_MODE 0 "gcn_valu_dst_operand")
+	(vec_merge:VEC_FP_MODE
+	  (fma:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_valu_src1_operand")
+	    (neg:VEC_FP_MODE
+	      (match_operand:VEC_FP_MODE 2 "gcn_valu_src1_operand"))
+	    (match_operand:VEC_FP_MODE 3 "gcn_valu_src1_operand"))
+	  (match_dup 5)
+	  (match_dup 4)))]
+  ""
+  {
+    operands[4] = gcn_full_exec_reg ();
+    operands[5] = gcn_gen_undef (<MODE>mode);
+  })
+
+(define_expand "fma<mode>4"
+  [(parallel [(set (match_operand:FP_MODE 0 "gcn_valu_dst_operand")
+		   (fma:FP_MODE
+		     (match_operand:FP_MODE 1 "gcn_valu_src1_operand")
+		     (match_operand:FP_MODE 2 "gcn_valu_src1_operand")
+		     (match_operand:FP_MODE 3 "gcn_valu_src1_operand")))
+	      (use (match_dup 4))])]
+  ""
+  {
+    operands[4] = gcn_scalar_exec ();
+  })
+
+(define_expand "fma<mode>4_negop2"
+  [(parallel [(set (match_operand:FP_MODE 0 "gcn_valu_dst_operand")
+		   (fma:FP_MODE
+		     (match_operand:FP_MODE 1 "gcn_valu_src1_operand")
+		     (neg:FP_MODE
+		       (match_operand:FP_MODE 2 "gcn_valu_src1_operand"))
+		     (match_operand:FP_MODE 3 "gcn_valu_src1_operand")))
+	      (use (match_dup 4))])]
+  ""
+  {
+    operands[4] = gcn_scalar_exec ();
+  })
+
+;; }}}
+;; {{{ FP division
+
+(define_insn "recip<mode>_vector"
+  [(set (match_operand:VEC_FP_MODE 0 "register_operand"		     "=  v")
+	(vec_merge:VEC_FP_MODE
+	  (div:VEC_FP_MODE
+	    (match_operand:VEC_FP_MODE 1 "gcn_vec1d_operand"	     "   A")
+	    (match_operand:VEC_FP_MODE 2 "gcn_alu_operand"	     "vSSB"))
+	  (match_operand:VEC_FP_MODE 4 "gcn_register_or_unspec_operand"
+								     "  U0")
+	  (match_operand:DI 3 "gcn_exec_reg_operand"		     "   e")))]
+  ""
+  "v_rcp%i0\t%0, %2"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+(define_insn "recip<mode>_scalar"
+  [(set (match_operand:FP_MODE 0 "register_operand"	 "=  v")
+	(div:FP_MODE
+	  (match_operand:FP_MODE 1 "gcn_const1d_operand" "   A")
+	  (match_operand:FP_MODE 2 "gcn_alu_operand"	 "vSSB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"		 "   e"))]
+  ""
+  "v_rcp%i0\t%0, %2"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+;; Do division via a = b * 1/c
+;; The v_rcp_* instructions are not sufficiently accurate on their own,
+;; so we use 2 v_fma_* instructions to do one round of Newton-Raphson
+;; which the ISA manual says is enough to improve the reciprocal accuracy.
+;;
+;; FIXME: This does not handle denormals, NaNs, division-by-zero etc.
+
+(define_expand "div<mode>3"
+  [(match_operand:VEC_FP_MODE 0 "gcn_valu_dst_operand")
+   (match_operand:VEC_FP_MODE 1 "gcn_valu_src0_operand")
+   (match_operand:VEC_FP_MODE 2 "gcn_valu_src0_operand")]
+  "flag_reciprocal_math"
+  {
+    rtx one = gcn_vec_constant (<MODE>mode,
+		  const_double_from_real_value (dconst1, <SCALAR_MODE>mode));
+    rtx two = gcn_vec_constant (<MODE>mode,
+		  const_double_from_real_value (dconst2, <SCALAR_MODE>mode));
+    rtx initrcp = gen_reg_rtx (<MODE>mode);
+    rtx fma = gen_reg_rtx (<MODE>mode);
+    rtx rcp;
+
+    bool is_rcp = (GET_CODE (operands[1]) == CONST_VECTOR
+		   && real_identical
+		        (CONST_DOUBLE_REAL_VALUE
+			  (CONST_VECTOR_ELT (operands[1], 0)), &dconstm1));
+
+    if (is_rcp)
+      rcp = operands[0];
+    else
+      rcp = gen_reg_rtx (<MODE>mode);
+
+    emit_insn (gen_recip<mode>_vector (initrcp, one, operands[2],
+				       gcn_full_exec_reg (),
+				       gcn_gen_undef (<MODE>mode)));
+    emit_insn (gen_fma<mode>4_negop2 (fma, initrcp, operands[2], two));
+    emit_insn (gen_mul<mode>3 (rcp, initrcp, fma));
+
+    if (!is_rcp)
+      emit_insn (gen_mul<mode>3 (operands[0], operands[1], rcp));
+
+    DONE;
+  })
+
+(define_expand "div<mode>3"
+  [(match_operand:FP_MODE 0 "gcn_valu_dst_operand")
+   (match_operand:FP_MODE 1 "gcn_valu_src0_operand")
+   (match_operand:FP_MODE 2 "gcn_valu_src0_operand")]
+  "flag_reciprocal_math"
+  {
+    rtx one = const_double_from_real_value (dconst1, <MODE>mode);
+    rtx two = const_double_from_real_value (dconst2, <MODE>mode);
+    rtx initrcp = gen_reg_rtx (<MODE>mode);
+    rtx fma = gen_reg_rtx (<MODE>mode);
+    rtx rcp;
+
+    bool is_rcp = (GET_CODE (operands[1]) == CONST_DOUBLE
+		   && real_identical (CONST_DOUBLE_REAL_VALUE (operands[1]),
+				      &dconstm1));
+
+    if (is_rcp)
+      rcp = operands[0];
+    else
+      rcp = gen_reg_rtx (<MODE>mode);
+
+    emit_insn (gen_recip<mode>_scalar (initrcp, one, operands[2],
+				       gcn_scalar_exec ()));
+    emit_insn (gen_fma<mode>4_negop2 (fma, initrcp, operands[2], two));
+    emit_insn (gen_mul<mode>3 (rcp, initrcp, fma));
+
+    if (!is_rcp)
+      emit_insn (gen_mul<mode>3 (operands[0], operands[1], rcp));
+
+    DONE;
+  })
+
+;; }}}
+;; {{{ Int/FP conversions
+
+(define_mode_iterator CVT_FROM_MODE [HI SI HF SF DF])
+(define_mode_iterator CVT_TO_MODE [HI SI HF SF DF])
+(define_mode_iterator CVT_F_MODE [HF SF DF])
+(define_mode_iterator CVT_I_MODE [HI SI])
+
+(define_mode_iterator VCVT_FROM_MODE [V64HI V64SI V64HF V64SF V64DF])
+(define_mode_iterator VCVT_TO_MODE [V64HI V64SI V64HF V64SF V64DF])
+(define_mode_iterator VCVT_F_MODE [V64HF V64SF V64DF])
+(define_mode_iterator VCVT_I_MODE [V64HI V64SI])
+
+(define_code_iterator cvt_op [fix unsigned_fix
+			      float unsigned_float
+			      float_extend float_truncate])
+(define_code_attr cvt_name [(fix "fix_trunc") (unsigned_fix "fixuns_trunc")
+			    (float "float") (unsigned_float "floatuns")
+			    (float_extend "extend") (float_truncate "trunc")])
+(define_code_attr cvt_operands [(fix "%i0%i1") (unsigned_fix "%u0%i1")
+				(float "%i0%i1") (unsigned_float "%i0%u1")
+				(float_extend "%i0%i1")
+				(float_truncate "%i0%i1")])
+
+(define_expand "<cvt_name><CVT_FROM_MODE:mode><CVT_F_MODE:mode>2"
+  [(parallel [(set (match_operand:CVT_F_MODE 0 "register_operand")
+		   (cvt_op:CVT_F_MODE
+		     (match_operand:CVT_FROM_MODE 1 "gcn_valu_src0_operand")))
+	      (use (match_dup 2))])]
+  "gcn_valid_cvt_p (<CVT_FROM_MODE:MODE>mode, <CVT_F_MODE:MODE>mode,
+		    <cvt_name>_cvt)"
+  {
+    operands[2] = gcn_scalar_exec ();
+  })
+
+(define_expand "<cvt_name><VCVT_FROM_MODE:mode><VCVT_F_MODE:mode>2"
+  [(set (match_operand:VCVT_F_MODE 0 "register_operand")
+	(vec_merge:VCVT_F_MODE
+	  (cvt_op:VCVT_F_MODE
+	    (match_operand:VCVT_FROM_MODE 1 "gcn_valu_src0_operand"))
+	  (match_dup 3)
+	  (match_dup 2)))]
+  "gcn_valid_cvt_p (<VCVT_FROM_MODE:MODE>mode, <VCVT_F_MODE:MODE>mode,
+		    <cvt_name>_cvt)"
+  {
+    operands[2] = gcn_full_exec_reg ();
+    operands[3] = gcn_gen_undef (<VCVT_F_MODE:MODE>mode);
+  })
+
+(define_expand "<cvt_name><CVT_F_MODE:mode><CVT_I_MODE:mode>2"
+  [(parallel [(set (match_operand:CVT_I_MODE 0 "register_operand")
+		   (cvt_op:CVT_I_MODE
+		     (match_operand:CVT_F_MODE 1 "gcn_valu_src0_operand")))
+	      (use (match_dup 2))])]
+  "gcn_valid_cvt_p (<CVT_F_MODE:MODE>mode, <CVT_I_MODE:MODE>mode,
+		    <cvt_name>_cvt)"
+  {
+    operands[2] = gcn_scalar_exec ();
+  })
+
+(define_expand "<cvt_name><VCVT_F_MODE:mode><VCVT_I_MODE:mode>2"
+  [(set (match_operand:VCVT_I_MODE 0 "register_operand")
+	(vec_merge:VCVT_I_MODE
+	  (cvt_op:VCVT_I_MODE
+	    (match_operand:VCVT_F_MODE 1 "gcn_valu_src0_operand"))
+	  (match_dup 3)
+	  (match_dup 2)))]
+  "gcn_valid_cvt_p (<VCVT_F_MODE:MODE>mode, <VCVT_I_MODE:MODE>mode,
+		    <cvt_name>_cvt)"
+  {
+    operands[2] = gcn_full_exec_reg ();
+    operands[3] = gcn_gen_undef (<VCVT_I_MODE:MODE>mode);
+  })
+
+(define_insn "<cvt_name><CVT_FROM_MODE:mode><CVT_TO_MODE:mode>2_insn"
+  [(set (match_operand:CVT_TO_MODE 0 "register_operand"	   "=  v")
+	(cvt_op:CVT_TO_MODE
+	  (match_operand:CVT_FROM_MODE 1 "gcn_alu_operand" "vSSB")))
+   (use (match_operand:DI 2 "gcn_exec_operand"		   "   e"))]
+  "gcn_valid_cvt_p (<CVT_FROM_MODE:MODE>mode, <CVT_TO_MODE:MODE>mode,
+		    <cvt_name>_cvt)"
+  "v_cvt<cvt_operands>\t%0, %1"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+(define_insn "<cvt_name><VCVT_FROM_MODE:mode><VCVT_TO_MODE:mode>2_insn"
+  [(set (match_operand:VCVT_TO_MODE 0 "register_operand"	    "=  v")
+	(vec_merge:VCVT_TO_MODE
+	  (cvt_op:VCVT_TO_MODE
+	    (match_operand:VCVT_FROM_MODE 1 "gcn_alu_operand"	    "vSSB"))
+	  (match_operand:VCVT_TO_MODE 2 "gcn_alu_or_unspec_operand" "  U0")
+	  (match_operand:DI 3 "gcn_exec_operand"		    "   e")))]
+  "gcn_valid_cvt_p (<VCVT_FROM_MODE:MODE>mode, <VCVT_TO_MODE:MODE>mode,
+		    <cvt_name>_cvt)"
+  "v_cvt<cvt_operands>\t%0, %1"
+  [(set_attr "type" "vop1")
+   (set_attr "length" "8")])
+
+;; }}}
+;; {{{ Int/int conversions
+
+;; GCC can already do these for scalar types, but not for vector types.
+;; Unfortunately you can't just do SUBREG on a vector to select the low part,
+;; so there must be a few tricks here.
+
+(define_insn_and_split "vec_truncatev64div64si"
+  [(set (match_operand:V64SI 0 "register_operand"	     "=v,&v")
+	(vec_merge:V64SI
+	  (truncate:V64SI
+	    (match_operand:V64DI 1 "register_operand"        " 0, v"))
+	  (match_operand:V64SI 2 "gcn_alu_or_unspec_operand" "U0,U0")
+	  (match_operand:DI 3 "gcn_exec_operand"	     " e, e")))]
+  ""
+  "#"
+  "reload_completed"
+  [(parallel [(set (match_dup 0)
+		   (vec_merge:V64SI (match_dup 1) (match_dup 2) (match_dup 3)))
+	      (clobber (scratch:V64DI))])]
+  {
+    operands[1] = gcn_operand_part (V64SImode, operands[1], 0);
+  }
+  [(set_attr "type" "vop2")
+   (set_attr "length" "0,4")])
+
+;; }}}
+;; {{{ Vector comparison/merge
+
+(define_expand "vec_cmp<mode>di"
+  [(parallel
+     [(set (match_operand:DI 0 "register_operand")
+	   (and:DI
+	     (match_operator 1 "comparison_operator"
+	       [(match_operand:VEC_1REG_MODE 2 "gcn_alu_operand")
+		(match_operand:VEC_1REG_MODE 3 "gcn_vop3_operand")])
+	     (match_dup 4)))
+      (clobber (match_scratch:DI 5))])]
+  ""
+  {
+    operands[4] = gcn_full_exec_reg ();
+  })
+
+(define_expand "vec_cmpu<mode>di"
+  [(parallel
+     [(set (match_operand:DI 0 "register_operand")
+	   (and:DI
+	     (match_operator 1 "comparison_operator"
+	       [(match_operand:VEC_1REG_INT_MODE 2 "gcn_alu_operand")
+		(match_operand:VEC_1REG_INT_MODE 3 "gcn_vop3_operand")])
+	     (match_dup 4)))
+      (clobber (match_scratch:DI 5))])]
+  ""
+  {
+    operands[4] = gcn_full_exec_reg ();
+  })
+
+(define_insn "vec_cmp<mode>di_insn"
+  [(set (match_operand:DI 0 "register_operand"	       "=cV,cV,  e, e,Sg,Sg")
+	(and:DI
+	  (match_operator 1 "comparison_operator"
+	    [(match_operand:VEC_1REG_MODE 2 "gcn_alu_operand"
+						       "vSS, B,vSS, B, v,vA")
+	     (match_operand:VEC_1REG_MODE 3 "gcn_vop3_operand"
+						       "  v, v,  v, v,vA, v")])
+	  (match_operand:DI 4 "gcn_exec_reg_operand"   "  e, e,  e, e, e, e")))
+   (clobber (match_scratch:DI 5			       "= X, X, cV,cV, X, X"))]
+  ""
+  "@
+   v_cmp%E1\tvcc, %2, %3
+   v_cmp%E1\tvcc, %2, %3
+   v_cmpx%E1\tvcc, %2, %3
+   v_cmpx%E1\tvcc, %2, %3
+   v_cmp%E1\t%0, %2, %3
+   v_cmp%E1\t%0, %2, %3"
+  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a,vop3a")
+   (set_attr "length" "4,8,4,8,8,8")])
+
+(define_insn "vec_cmp<mode>di_dup"
+  [(set (match_operand:DI 0 "register_operand"		    "=cV,cV, e,e,Sg")
+	(and:DI
+	  (match_operator 1 "comparison_operator"
+	    [(vec_duplicate:VEC_1REG_MODE
+	       (match_operand:<SCALAR_MODE> 2 "gcn_alu_operand"
+							    " SS, B,SS,B, A"))
+	     (match_operand:VEC_1REG_MODE 3 "gcn_vop3_operand"
+							    "  v, v, v,v, v")])
+	  (match_operand:DI 4 "gcn_exec_reg_operand"	    "  e, e, e,e, e")))
+   (clobber (match_scratch:DI 5				    "= X,X,cV,cV, X"))]
+  ""
+  "@
+   v_cmp%E1\tvcc, %2, %3
+   v_cmp%E1\tvcc, %2, %3
+   v_cmpx%E1\tvcc, %2, %3
+   v_cmpx%E1\tvcc, %2, %3
+   v_cmp%E1\t%0, %2, %3"
+  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a")
+   (set_attr "length" "4,8,4,8,8")])
+
+(define_expand "vcond_mask_<mode>di"
+  [(parallel
+    [(set (match_operand:VEC_REG_MODE 0 "register_operand" "")
+	  (vec_merge:VEC_REG_MODE
+	    (match_operand:VEC_REG_MODE 1 "gcn_vop3_operand" "")
+	    (match_operand:VEC_REG_MODE 2 "gcn_alu_operand" "")
+	    (match_operand:DI 3 "register_operand" "")))
+     (clobber (scratch:V64DI))])]
+  ""
+  "")
+
+(define_expand "vcond<VEC_1REG_MODE:mode><VEC_1REG_ALT:mode>"
+  [(match_operand:VEC_1REG_MODE 0 "register_operand")
+   (match_operand:VEC_1REG_MODE 1 "gcn_vop3_operand")
+   (match_operand:VEC_1REG_MODE 2 "gcn_alu_operand")
+   (match_operator 3 "comparison_operator"
+     [(match_operand:VEC_1REG_ALT 4 "gcn_alu_operand")
+      (match_operand:VEC_1REG_ALT 5 "gcn_vop3_operand")])]
+  ""
+  {
+    rtx tmp = gen_reg_rtx (DImode);
+    rtx cmp_op = gen_rtx_fmt_ee (GET_CODE (operands[3]), DImode, operands[4],
+				 operands[5]);
+    rtx set = gen_rtx_SET (tmp, gen_rtx_AND (DImode, cmp_op,
+					     gcn_full_exec_reg ()));
+    rtx clobber = gen_rtx_CLOBBER (VOIDmode, gen_rtx_SCRATCH (DImode));
+    emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, set, clobber)));
+    emit_insn (gen_vcond_mask_<mode>di (operands[0], operands[1], operands[2],
+					tmp));
+    DONE;
+  })
+
+
+(define_expand "vcondu<VEC_1REG_INT_MODE:mode><VEC_1REG_INT_ALT:mode>"
+  [(match_operand:VEC_1REG_INT_MODE 0 "register_operand")
+   (match_operand:VEC_1REG_INT_MODE 1 "gcn_vop3_operand")
+   (match_operand:VEC_1REG_INT_MODE 2 "gcn_alu_operand")
+   (match_operator 3 "comparison_operator"
+     [(match_operand:VEC_1REG_INT_ALT 4 "gcn_alu_operand")
+      (match_operand:VEC_1REG_INT_ALT 5 "gcn_vop3_operand")])]
+  ""
+  {
+    rtx tmp = gen_reg_rtx (DImode);
+    rtx cmp_op = gen_rtx_fmt_ee (GET_CODE (operands[3]), DImode, operands[4],
+				 operands[5]);
+    rtx set = gen_rtx_SET (tmp,
+			   gen_rtx_AND (DImode, cmp_op, gcn_full_exec_reg ()));
+    rtx clobber = gen_rtx_CLOBBER (VOIDmode, gen_rtx_SCRATCH (DImode));
+    emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, set, clobber)));
+    emit_insn (gen_vcond_mask_<mode>di (operands[0], operands[1], operands[2],
+				        tmp));
+    DONE;
+  })
+
+;; }}}
+;; {{{ Fully masked loop support
+;;
+;; The autovectorizer requires the mask is a vector value (we use V64BImode),
+;; but the backend uses simple DImode for the same thing.
+;;
+;; There are two kinds of patterns here:
+;;
+;; 1) Expanders for masked vector operatoions (while_ult, maskload, etc.)
+;;
+;; 2) Expanders that convert general V64BImode operations to DImode
+;;    equivalents.
+;
+(define_expand "while_ultsiv64bi"
+  [(match_operand:V64BI 0 "register_operand")
+   (match_operand:SI 1 "")
+   (match_operand:SI 2 "")]
+  ""
+  {
+    operands[0] = gcn_convert_mask_mode (operands[0]);
+
+    if (GET_CODE (operands[1]) != CONST_INT
+	|| GET_CODE (operands[2]) != CONST_INT)
+      {
+        rtx exec = gcn_full_exec_reg ();
+	rtx _0_1_2_3 = gen_rtx_REG (V64SImode, VGPR_REGNO (1));
+	rtx tmp = _0_1_2_3;
+	if (GET_CODE (operands[1]) != CONST_INT
+	    || INTVAL (operands[1]) != 0)
+	  {
+	    tmp = gen_reg_rtx (V64SImode);
+	    emit_insn (gen_addv64si3_vector_dup (tmp, _0_1_2_3, operands[1],
+						 exec, tmp));
+	  }
+	emit_insn (gen_vec_cmpv64sidi_dup (operands[0],
+					   gen_rtx_GT (VOIDmode, 0, 0),
+					   operands[2], tmp, exec));
+      }
+    else
+      {
+        HOST_WIDE_INT diff = INTVAL (operands[2]) - INTVAL (operands[1]);
+	HOST_WIDE_INT mask = (diff >= 64 ? -1 : ~((HOST_WIDE_INT)-1 << diff));
+        emit_move_insn (operands[0], gen_rtx_CONST_INT (VOIDmode, mask));
+      }
+    DONE;
+  })
+
+(define_expand "cstorev64bi4"
+  [(match_operand:BI 0 "gcn_conditional_register_operand")
+   (match_operator:BI 1 "gcn_compare_operator"
+     [(match_operand:V64BI 2 "gcn_alu_operand")
+      (match_operand:V64BI 3 "gcn_alu_operand")])]
+  ""
+  {
+    operands[2] = gcn_convert_mask_mode (operands[2]);
+    operands[3] = gcn_convert_mask_mode (operands[3]);
+
+    emit_insn (gen_cstoredi4 (operands[0], operands[1], operands[2],
+			      operands[3]));
+    DONE;
+  })
+
+(define_expand "cbranchv64bi4"
+  [(match_operator 0 "gcn_compare_operator"
+     [(match_operand:SI 1 "")
+      (match_operand:SI 2 "")])
+   (match_operand 3)]
+  ""
+  {
+    operands[1] = gcn_convert_mask_mode (operands[1]);
+    operands[2] = gcn_convert_mask_mode (operands[2]);
+
+    emit_insn(gen_cbranchdi4 (operands[0], operands[1], operands[2],
+			      operands[3]));
+    DONE;
+  })
+
+(define_expand "movv64bi"
+  [(set (match_operand:V64BI 0 "nonimmediate_operand")
+	(match_operand:V64BI 1 "general_operand"))]
+  ""
+  {
+    operands[0] = gcn_convert_mask_mode (operands[0]);
+    operands[1] = gcn_convert_mask_mode (operands[1]);
+  })
+
+(define_expand "vcond_mask_<mode>v64bi"
+  [(match_operand:VEC_REG_MODE 0 "register_operand")
+   (match_operand:VEC_REG_MODE 1 "register_operand")
+   (match_operand:VEC_REG_MODE 2 "register_operand")
+   (match_operand:V64BI 3 "register_operand")]
+  ""
+  {
+    operands[3] = gcn_convert_mask_mode (operands[3]);
+
+    emit_insn (gen_vcond_mask_<mode>di (operands[0], operands[1], operands[2],
+					operands[3]));
+    DONE;
+  })
+
+(define_expand "maskload<mode>v64bi"
+  [(match_operand:VEC_REG_MODE 0 "register_operand")
+   (match_operand:VEC_REG_MODE 1 "memory_operand")
+   (match_operand 2 "")]
+  ""
+  {
+    rtx exec = force_reg (DImode, gcn_convert_mask_mode (operands[2]));
+    rtx addr = gcn_expand_scalar_to_vector_address
+		(<MODE>mode, exec, operands[1], gen_rtx_SCRATCH (V64DImode));
+    rtx as = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[1]));
+    rtx v = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));
+    rtx undef = gcn_gen_undef (<MODE>mode);
+    emit_insn (gen_gather<mode>_expr (operands[0], addr, as, v, undef, exec));
+    DONE;
+  })
+
+(define_expand "maskstore<mode>v64bi"
+  [(match_operand:VEC_REG_MODE 0 "memory_operand")
+   (match_operand:VEC_REG_MODE 1 "register_operand")
+   (match_operand 2 "")]
+  ""
+  {
+    rtx exec = force_reg (DImode, gcn_convert_mask_mode (operands[2]));
+    rtx addr = gcn_expand_scalar_to_vector_address
+		(<MODE>mode, exec, operands[0], gen_rtx_SCRATCH (V64DImode));
+    rtx as = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[0]));
+    rtx v = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[0]));
+    emit_insn (gen_scatter<mode>_expr (addr, operands[1], as, v, exec));
+    DONE;
+  })
+
+(define_expand "mask_gather_load<mode>"
+  [(match_operand:VEC_REG_MODE 0 "register_operand")
+   (match_operand:DI 1 "register_operand")
+   (match_operand 2 "register_operand")
+   (match_operand 3 "immediate_operand")
+   (match_operand:SI 4 "gcn_alu_operand")
+   (match_operand:V64BI 5 "")]
+  ""
+  {
+    rtx exec = force_reg (DImode, gcn_convert_mask_mode (operands[5]));
+
+    /* TODO: more conversions will be needed when more types are vectorized. */
+    if (GET_MODE (operands[2]) == V64DImode)
+      {
+        rtx tmp = gen_reg_rtx (V64SImode);
+	emit_insn (gen_vec_truncatev64div64si (tmp, operands[2],
+					       gcn_gen_undef (V64SImode),
+					       exec));
+	operands[2] = tmp;
+      }
+
+    emit_insn (gen_gather<mode>_exec (operands[0], operands[1], operands[2],
+				      operands[3], operands[4], exec));
+    DONE;
+  })
+
+(define_expand "mask_scatter_store<mode>"
+  [(match_operand:DI 0 "register_operand")
+   (match_operand 1 "register_operand")
+   (match_operand 2 "immediate_operand")
+   (match_operand:SI 3 "gcn_alu_operand")
+   (match_operand:VEC_REG_MODE 4 "register_operand")
+   (match_operand:V64BI 5 "")]
+  ""
+  {
+    rtx exec = force_reg (DImode, gcn_convert_mask_mode (operands[5]));
+
+    /* TODO: more conversions will be needed when more types are vectorized. */
+    if (GET_MODE (operands[1]) == V64DImode)
+      {
+        rtx tmp = gen_reg_rtx (V64SImode);
+	emit_insn (gen_vec_truncatev64div64si (tmp, operands[1],
+					       gcn_gen_undef (V64SImode),
+					       exec));
+	operands[1] = tmp;
+      }
+
+    emit_insn (gen_scatter<mode>_exec (operands[0], operands[1], operands[2],
+				       operands[3], operands[4], exec));
+    DONE;
+  })
+
+; FIXME this should be VEC_REG_MODE, but not all dependencies are implemented.
+(define_mode_iterator COND_MODE [V64SI V64DI V64SF V64DF])
+(define_mode_iterator COND_INT_MODE [V64SI V64DI])
+
+(define_code_iterator cond_op [plus minus])
+
+(define_expand "cond_<expander><mode>"
+  [(match_operand:COND_MODE 0 "register_operand")
+   (match_operand:V64BI 1 "register_operand")
+   (cond_op:COND_MODE
+     (match_operand:COND_MODE 2 "gcn_alu_operand")
+     (match_operand:COND_MODE 3 "gcn_alu_operand"))
+   (match_operand:COND_MODE 4 "register_operand")]
+  ""
+  {
+    operands[1] = force_reg (DImode, gcn_convert_mask_mode (operands[1]));
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+
+    emit_insn (gen_<expander><mode>3_vector (operands[0], operands[2],
+					     operands[3], operands[1],
+					     operands[4]));
+    DONE;
+  })
+
+(define_code_iterator cond_bitop [and ior xor])
+
+(define_expand "cond_<expander><mode>"
+  [(match_operand:COND_INT_MODE 0 "register_operand")
+   (match_operand:V64BI 1 "register_operand")
+   (cond_bitop:COND_INT_MODE
+     (match_operand:COND_INT_MODE 2 "gcn_alu_operand")
+     (match_operand:COND_INT_MODE 3 "gcn_alu_operand"))
+   (match_operand:COND_INT_MODE 4 "register_operand")]
+  ""
+  {
+    operands[1] = force_reg (DImode, gcn_convert_mask_mode (operands[1]));
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+
+    emit_insn (gen_<expander><mode>3_vector (operands[0], operands[2],
+					     operands[3], operands[1],
+					     operands[4]));
+    DONE;
+  })
+
+(define_expand "vec_cmp<mode>v64bi"
+  [(match_operand:V64BI 0 "register_operand")
+   (match_operator 1 "comparison_operator"
+     [(match_operand:VEC_1REG_MODE 2 "gcn_alu_operand")
+      (match_operand:VEC_1REG_MODE 3 "gcn_vop3_operand")])]
+  ""
+  {
+    operands[0] = gcn_convert_mask_mode (operands[0]);
+
+    emit_insn (gen_vec_cmp<mode>di (operands[0], operands[1], operands[2],
+				    operands[3]));
+    DONE;
+  })
+
+(define_expand "vec_cmpu<mode>v64bi"
+  [(match_operand:V64BI 0 "register_operand")
+   (match_operator 1 "comparison_operator"
+     [(match_operand:VEC_1REG_INT_MODE 2 "gcn_alu_operand")
+      (match_operand:VEC_1REG_INT_MODE 3 "gcn_vop3_operand")])]
+  ""
+  {
+    operands[0] = gcn_convert_mask_mode (operands[0]);
+
+    emit_insn (gen_vec_cmpu<mode>di (operands[0], operands[1], operands[2],
+				     operands[3]));
+    DONE;
+  })
+
+;; }}}
+;; {{{ Vector reductions
+
+(define_int_iterator REDUC_UNSPEC [UNSPEC_SMIN_DPP_SHR UNSPEC_SMAX_DPP_SHR
+				   UNSPEC_UMIN_DPP_SHR UNSPEC_UMAX_DPP_SHR
+				   UNSPEC_PLUS_DPP_SHR
+				   UNSPEC_AND_DPP_SHR
+				   UNSPEC_IOR_DPP_SHR UNSPEC_XOR_DPP_SHR])
+
+(define_int_iterator REDUC_2REG_UNSPEC [UNSPEC_PLUS_DPP_SHR
+					UNSPEC_AND_DPP_SHR
+					UNSPEC_IOR_DPP_SHR UNSPEC_XOR_DPP_SHR])
+
+; FIXME: Isn't there a better way of doing this?
+(define_int_attr reduc_unspec [(UNSPEC_SMIN_DPP_SHR "UNSPEC_SMIN_DPP_SHR")
+			       (UNSPEC_SMAX_DPP_SHR "UNSPEC_SMAX_DPP_SHR")
+			       (UNSPEC_UMIN_DPP_SHR "UNSPEC_UMIN_DPP_SHR")
+			       (UNSPEC_UMAX_DPP_SHR "UNSPEC_UMAX_DPP_SHR")
+			       (UNSPEC_PLUS_DPP_SHR "UNSPEC_PLUS_DPP_SHR")
+			       (UNSPEC_AND_DPP_SHR "UNSPEC_AND_DPP_SHR")
+			       (UNSPEC_IOR_DPP_SHR "UNSPEC_IOR_DPP_SHR")
+			       (UNSPEC_XOR_DPP_SHR "UNSPEC_XOR_DPP_SHR")])
+
+(define_int_attr reduc_op [(UNSPEC_SMIN_DPP_SHR "smin")
+			   (UNSPEC_SMAX_DPP_SHR "smax")
+			   (UNSPEC_UMIN_DPP_SHR "umin")
+			   (UNSPEC_UMAX_DPP_SHR "umax")
+			   (UNSPEC_PLUS_DPP_SHR "plus")
+			   (UNSPEC_AND_DPP_SHR "and")
+			   (UNSPEC_IOR_DPP_SHR "ior")
+			   (UNSPEC_XOR_DPP_SHR "xor")])
+
+(define_int_attr reduc_insn [(UNSPEC_SMIN_DPP_SHR "v_min%i0")
+			     (UNSPEC_SMAX_DPP_SHR "v_max%i0")
+			     (UNSPEC_UMIN_DPP_SHR "v_min%u0")
+			     (UNSPEC_UMAX_DPP_SHR "v_max%u0")
+			     (UNSPEC_PLUS_DPP_SHR "v_add%u0")
+			     (UNSPEC_AND_DPP_SHR  "v_and%b0")
+			     (UNSPEC_IOR_DPP_SHR  "v_or%b0")
+			     (UNSPEC_XOR_DPP_SHR  "v_xor%b0")])
+
+(define_expand "reduc_<reduc_op>_scal_<mode>"
+  [(set (match_operand:<SCALAR_MODE> 0 "register_operand")
+        (unspec:<SCALAR_MODE>
+	  [(match_operand:VEC_1REG_MODE 1 "register_operand")]
+	  REDUC_UNSPEC))]
+  ""
+  {
+    rtx tmp = gcn_expand_reduc_scalar (<MODE>mode, operands[1],
+				       <reduc_unspec>);
+
+    /* The result of the reduction is in lane 63 of tmp.  */
+    emit_insn (gen_mov_from_lane63_<mode> (operands[0], tmp));
+
+    DONE;
+  })
+
+(define_expand "reduc_<reduc_op>_scal_v64di"
+  [(set (match_operand:DI 0 "register_operand")
+        (unspec:DI
+	  [(match_operand:V64DI 1 "register_operand")]
+	  REDUC_2REG_UNSPEC))]
+  ""
+  {
+    rtx tmp = gcn_expand_reduc_scalar (V64DImode, operands[1],
+				       <reduc_unspec>);
+
+    /* The result of the reduction is in lane 63 of tmp.  */
+    emit_insn (gen_mov_from_lane63_v64di (operands[0], tmp));
+
+    DONE;
+  })
+
+(define_insn "*<reduc_op>_dpp_shr_<mode>"
+  [(set (match_operand:VEC_1REG_MODE 0 "register_operand"   "=v")
+	(unspec:VEC_1REG_MODE
+	  [(match_operand:VEC_1REG_MODE 1 "register_operand" "v")
+	   (match_operand:VEC_1REG_MODE 2 "register_operand" "v")
+	   (match_operand:SI 3 "const_int_operand"	     "n")]
+	  REDUC_UNSPEC))]
+  "!(TARGET_GCN3 && SCALAR_INT_MODE_P (<SCALAR_MODE>mode)
+     && <reduc_unspec> == UNSPEC_PLUS_DPP_SHR)"
+  {
+    return gcn_expand_dpp_shr_insn (<MODE>mode, "<reduc_insn>",
+				    <reduc_unspec>, INTVAL (operands[3]));
+  }
+  [(set_attr "type" "vop_dpp")
+   (set_attr "exec" "full")
+   (set_attr "length" "8")])
+
+(define_insn_and_split "*<reduc_op>_dpp_shr_v64di"
+  [(set (match_operand:V64DI 0 "register_operand"   "=&v")
+	(unspec:V64DI
+	  [(match_operand:V64DI 1 "register_operand" "v0")
+	   (match_operand:V64DI 2 "register_operand" "v0")
+	   (match_operand:SI 3 "const_int_operand"    "n")]
+	  REDUC_2REG_UNSPEC))]
+  ""
+  "#"
+  "reload_completed"
+  [(set (match_dup 4)
+	(unspec:V64SI
+	  [(match_dup 6) (match_dup 8) (match_dup 3)] REDUC_2REG_UNSPEC))
+   (set (match_dup 5)
+	(unspec:V64SI
+	  [(match_dup 7) (match_dup 9) (match_dup 3)] REDUC_2REG_UNSPEC))]
+  {
+    operands[4] = gcn_operand_part (V64DImode, operands[0], 0);
+    operands[5] = gcn_operand_part (V64DImode, operands[0], 1);
+    operands[6] = gcn_operand_part (V64DImode, operands[1], 0);
+    operands[7] = gcn_operand_part (V64DImode, operands[1], 1);
+    operands[8] = gcn_operand_part (V64DImode, operands[2], 0);
+    operands[9] = gcn_operand_part (V64DImode, operands[2], 1);
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "exec" "full")
+   (set_attr "length" "16")])
+
+; Special cases for addition.
+
+(define_insn "*plus_carry_dpp_shr_<mode>"
+  [(set (match_operand:VEC_1REG_INT_MODE 0 "register_operand"   "=v")
+	(unspec:VEC_1REG_INT_MODE
+	  [(match_operand:VEC_1REG_INT_MODE 1 "register_operand" "v")
+	   (match_operand:VEC_1REG_INT_MODE 2 "register_operand" "v")
+	   (match_operand:SI 3 "const_int_operand"		 "n")]
+	  UNSPEC_PLUS_CARRY_DPP_SHR))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  {
+    const char *insn = TARGET_GCN3 ? "v_add%u0" : "v_add_co%u0";
+    return gcn_expand_dpp_shr_insn (<MODE>mode, insn,
+				    UNSPEC_PLUS_CARRY_DPP_SHR,
+				    INTVAL (operands[3]));
+  }
+  [(set_attr "type" "vop_dpp")
+   (set_attr "exec" "full")
+   (set_attr "length" "8")])
+
+(define_insn "*plus_carry_in_dpp_shr_v64si"
+  [(set (match_operand:V64SI 0 "register_operand"   "=v")
+	(unspec:V64SI
+	  [(match_operand:V64SI 1 "register_operand" "v")
+	   (match_operand:V64SI 2 "register_operand" "v")
+	   (match_operand:SI 3 "const_int_operand"   "n")
+	   (match_operand:DI 4 "register_operand"   "cV")]
+	  UNSPEC_PLUS_CARRY_IN_DPP_SHR))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  {
+    const char *insn = TARGET_GCN3 ? "v_addc%u0" : "v_addc_co%u0";
+    return gcn_expand_dpp_shr_insn (V64SImode, insn,
+				    UNSPEC_PLUS_CARRY_IN_DPP_SHR,
+				    INTVAL (operands[3]));
+  }
+  [(set_attr "type" "vop_dpp")
+   (set_attr "exec" "full")
+   (set_attr "length" "8")])
+
+(define_insn_and_split "*plus_carry_dpp_shr_v64di"
+  [(set (match_operand:V64DI 0 "register_operand"   "=&v")
+	(unspec:V64DI
+	  [(match_operand:V64DI 1 "register_operand" "v0")
+	   (match_operand:V64DI 2 "register_operand" "v0")
+	   (match_operand:SI 3 "const_int_operand"    "n")]
+	  UNSPEC_PLUS_CARRY_DPP_SHR))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "reload_completed"
+  [(parallel [(set (match_dup 4)
+		(unspec:V64SI
+		  [(match_dup 6) (match_dup 8) (match_dup 3)]
+		  UNSPEC_PLUS_CARRY_DPP_SHR))
+	      (clobber (reg:DI VCC_REG))])
+   (parallel [(set (match_dup 5)
+		(unspec:V64SI
+		  [(match_dup 7) (match_dup 9) (match_dup 3) (reg:DI VCC_REG)]
+		  UNSPEC_PLUS_CARRY_IN_DPP_SHR))
+	      (clobber (reg:DI VCC_REG))])]
+  {
+    operands[4] = gcn_operand_part (V64DImode, operands[0], 0);
+    operands[5] = gcn_operand_part (V64DImode, operands[0], 1);
+    operands[6] = gcn_operand_part (V64DImode, operands[1], 0);
+    operands[7] = gcn_operand_part (V64DImode, operands[1], 1);
+    operands[8] = gcn_operand_part (V64DImode, operands[2], 0);
+    operands[9] = gcn_operand_part (V64DImode, operands[2], 1);
+  }
+  [(set_attr "type" "vmult")
+   (set_attr "exec" "full")
+   (set_attr "length" "16")])
+
+; Instructions to move a scalar value from lane 63 of a vector register.
+(define_insn "mov_from_lane63_<mode>"
+  [(set (match_operand:<SCALAR_MODE> 0 "register_operand"  "=Sg,v")
+	(unspec:<SCALAR_MODE>
+	  [(match_operand:VEC_1REG_MODE 1 "register_operand" "v,v")]
+	  UNSPEC_MOV_FROM_LANE63))]
+  ""
+  "@
+   v_readlane_b32\t%0, %1, 63
+   v_mov_b32\t%0, %1 wave_ror:1"
+  [(set_attr "type" "vop3a,vop_dpp")
+   (set_attr "exec" "*,full")
+   (set_attr "length" "8")])
+
+(define_insn "mov_from_lane63_v64di"
+  [(set (match_operand:DI 0 "register_operand"	     "=Sg,v")
+	(unspec:DI
+	  [(match_operand:V64DI 1 "register_operand"   "v,v")]
+	  UNSPEC_MOV_FROM_LANE63))]
+  ""
+  "@
+   v_readlane_b32\t%L0, %L1, 63\;v_readlane_b32\t%H0, %H1, 63
+   * if (REGNO (operands[0]) <= REGNO (operands[1]))	\
+       return \"v_mov_b32\t%L0, %L1 wave_ror:1\;\"	\
+	      \"v_mov_b32\t%H0, %H1 wave_ror:1\";	\
+     else						\
+       return \"v_mov_b32\t%H0, %H1 wave_ror:1\;\"	\
+	      \"v_mov_b32\t%L0, %L1 wave_ror:1\";"
+  [(set_attr "type" "vop3a,vop_dpp")
+   (set_attr "exec" "*,full")
+   (set_attr "length" "8")])
+
+;; }}}
+;; {{{ Miscellaneous
+
+(define_expand "vec_seriesv64si"
+  [(match_operand:V64SI 0 "register_operand")
+   (match_operand:SI 1 "gcn_alu_operand")
+   (match_operand:SI 2 "gcn_alu_operand")]
+  ""
+  {
+    rtx tmp = gen_reg_rtx (V64SImode);
+    rtx v1 = gen_rtx_REG (V64SImode, VGPR_REGNO (1));
+    rtx undef = gcn_gen_undef (V64SImode);
+    rtx exec = gcn_full_exec_reg ();
+
+    emit_insn (gen_mulv64si3_vector_dup (tmp, v1, operands[2], exec, undef));
+    emit_insn (gen_addv64si3_vector_dup (operands[0], tmp, operands[1], exec,
+					 undef));
+    DONE;
+  })
+
+(define_expand "vec_seriesv64di"
+  [(match_operand:V64DI 0 "register_operand")
+   (match_operand:DI 1 "gcn_alu_operand")
+   (match_operand:DI 2 "gcn_alu_operand")]
+  ""
+  {
+    rtx tmp = gen_reg_rtx (V64DImode);
+    rtx v1 = gen_rtx_REG (V64SImode, VGPR_REGNO (1));
+    rtx undef = gcn_gen_undef (V64DImode);
+    rtx exec = gcn_full_exec_reg ();
+
+    emit_insn (gen_mulv64di3_vector_zext_dup2 (tmp, v1, operands[2], exec,
+					       undef));
+    emit_insn (gen_addv64di3_vector_dup (operands[0], tmp, operands[1], exec,
+					 undef));
+    DONE;
+  })
+
+;; }}}

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 05/25] Add sorry_at diagnostic function.
  2018-09-05 13:39   ` David Malcolm
@ 2018-09-05 13:41     ` David Malcolm
  2018-09-11 10:30       ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: David Malcolm @ 2018-09-05 13:41 UTC (permalink / raw)
  To: ams, gcc-patches

On Wed, 2018-09-05 at 09:39 -0400, David Malcolm wrote:
> On Wed, 2018-09-05 at 12:49 +0100, ams@codesourcery.com wrote:
> > The plain "sorry" diagnostic only gives the "current" location,
> > which
> > is
> > typically the last line of the function or translation unit by time
> > we get to
> > the back end.
> > 
> > GCN uses "sorry" to report unsupported language features, such as
> > static
> > constructors, so it's useful to have a "sorry_at" variant.
> > 
> > This patch implements "sorry_at" according to the pattern of the
> > other "at"
> > variants.
> > 
> > 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> > 
> > 	gcc/
> > 	* diagnostic-core.h (sorry_at): New prototype.
> > 	* diagnostic.c (sorry_at): New function.
> > ---
> >  gcc/diagnostic-core.h |  1 +
> >  gcc/diagnostic.c      | 11 +++++++++++
> >  2 files changed, 12 insertions(+)
> 
> OK, thanks.

Actually, not quite: "sorry" has changed slightly on trunk since you
copied it.

Please add the:

  auto_diagnostic_group d;

line to the top of the function.

OK with that change.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
                   ` (24 preceding siblings ...)
  2018-09-05 13:40 ` [PATCH 21/25] GCN Back-end (part 1/2) Andrew Stubbs
@ 2018-09-05 13:43 ` Andrew Stubbs
  2018-09-05 14:22   ` Joseph Myers
  2018-11-09 19:40   ` Jeff Law
  25 siblings, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-05 13:43 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 64 bytes --]

This part initially failed to send due to size.

Here's part 2.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0021-gcn-port-pt2.patch --]
[-- Type: text/x-patch; name="0021-gcn-port-pt2.patch", Size: 307536 bytes --]

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
new file mode 100644
index 0000000..7e59b06
--- /dev/null
+++ b/gcc/config/gcn/gcn.c
@@ -0,0 +1,6161 @@
+/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+/* {{{ Includes.  */
+
+/* We want GET_MODE_SIZE et al to return integers, please.  */
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "memmodel.h"
+#include "rtl.h"
+#include "tree.h"
+#include "df.h"
+#include "tm_p.h"
+#include "stringpool.h"
+#include "optabs.h"
+#include "regs.h"
+#include "emit-rtl.h"
+#include "recog.h"
+#include "diagnostic-core.h"
+#include "insn-attr.h"
+#include "fold-const.h"
+#include "calls.h"
+#include "explow.h"
+#include "expr.h"
+#include "output.h"
+#include "cfgrtl.h"
+#include "langhooks.h"
+#include "builtins.h"
+#include "omp-general.h"
+#include "print-rtl.h"
+#include "attribs.h"
+#include "varasm.h"
+
+/* This file should be included last.  */
+#include "target-def.h"
+
+/* }}}  */
+/* {{{ Global variables.  */
+
+/* Constants used by FP instructions.  */
+
+static REAL_VALUE_TYPE dconst4, dconst1over2pi;
+static bool ext_gcn_constants_init = 0;
+
+/* Holds the ISA variant, derived from the command line parameters.  */
+
+int gcn_isa = 3;		/* Default to GCN3.  */
+
+/* Reserve this much space for LDS (for propagating variables from
+   worker-single mode to worker-partitioned mode), per workgroup.  Global
+   analysis could calculate an exact bound, but we don't do that yet.  */
+
+#define LDS_SIZE 32768
+
+/* }}}  */
+/* {{{ Initialization and options.  */
+
+/* Initialize machine_function.  */
+
+static struct machine_function *
+gcn_init_machine_status (void)
+{
+  struct machine_function *f;
+
+  f = ggc_cleared_alloc<machine_function> ();
+
+  /* Set up LDS allocation for broadcasting for this function.  */
+  f->lds_allocated = 32;
+  f->lds_allocs = hash_map<tree, int>::create_ggc (64);
+
+  /* And LDS temporary decls for worker reductions.  */
+  vec_alloc (f->reduc_decls, 0);
+
+  if (TARGET_GCN3)
+    f->use_flat_addressing = true;
+
+  return f;
+}
+
+/* Implement TARGET_OPTION_OVERRIDE.
+ 
+   Override option settings where defaults are variable, or we have specific
+   needs to consider.  */
+
+static void
+gcn_option_override (void)
+{
+  init_machine_status = gcn_init_machine_status;
+
+  /* The HSA runtime does not respect ELF load addresses, so force PIE.  */
+  if (!flag_pie)
+    flag_pie = 2;
+  if (!flag_pic)
+    flag_pic = flag_pie;
+
+  /* Disable debug info, for now.  */
+  debug_info_level = DINFO_LEVEL_NONE;
+
+  gcn_isa = gcn_arch == PROCESSOR_VEGA ? 5 : 3;
+
+  /* The default stack size needs to be small for offload kernels because
+     there may be many, many threads. But, a small stack is insufficient
+     for running the testsuite, so we use a larger default for the
+     stand alone case.  */
+  if (stack_size_opt == -1)
+    {
+      if (flag_openmp || flag_openacc)
+	/* 1280 bytes per work item = 80kB total.  */
+	stack_size_opt = 1280 * 64;
+      else
+	/* 1MB total.  */
+	stack_size_opt = 1048576;
+    }
+}
+
+/* }}}  */
+/* {{{ Attributes.  */
+
+/* This table defines the arguments that are permitted in
+   __attribute__ ((amdgpu_hsa_kernel (...))).
+
+   The names and values correspond to the HSA metadata that is encoded
+   into the assembler file and binary.  */
+
+static const struct gcn_kernel_arg_type
+{
+  const char *name;
+  const char *header_pseudo;
+  machine_mode mode;
+
+  /* This should be set to -1 or -2 for a dynamically allocated register
+     number.  Use -1 if this argument contributes to the user_sgpr_count,
+     -2 otherwise.  */
+  int fixed_regno;
+} gcn_kernel_arg_types[] = {
+  {"exec", NULL, DImode, EXEC_REG},
+#define PRIVATE_SEGMENT_BUFFER_ARG 1
+  {"private_segment_buffer",
+    "enable_sgpr_private_segment_buffer", TImode, -1},
+#define DISPATCH_PTR_ARG 2
+  {"dispatch_ptr", "enable_sgpr_dispatch_ptr", DImode, -1},
+#define QUEUE_PTR_ARG 3
+  {"queue_ptr", "enable_sgpr_queue_ptr", DImode, -1},
+#define KERNARG_SEGMENT_PTR_ARG 4
+  {"kernarg_segment_ptr", "enable_sgpr_kernarg_segment_ptr", DImode, -1},
+  {"dispatch_id", "enable_sgpr_dispatch_id", DImode, -1},
+#define FLAT_SCRATCH_INIT_ARG 6
+  {"flat_scratch_init", "enable_sgpr_flat_scratch_init", DImode, -1},
+#define FLAT_SCRATCH_SEGMENT_SIZE_ARG 7
+  {"private_segment_size", "enable_sgpr_private_segment_size", SImode, -1},
+  {"grid_workgroup_count_X",
+    "enable_sgpr_grid_workgroup_count_x", SImode, -1},
+  {"grid_workgroup_count_Y",
+    "enable_sgpr_grid_workgroup_count_y", SImode, -1},
+  {"grid_workgroup_count_Z",
+    "enable_sgpr_grid_workgroup_count_z", SImode, -1},
+#define WORKGROUP_ID_X_ARG 11
+  {"workgroup_id_X", "enable_sgpr_workgroup_id_x", SImode, -2},
+  {"workgroup_id_Y", "enable_sgpr_workgroup_id_y", SImode, -2},
+  {"workgroup_id_Z", "enable_sgpr_workgroup_id_z", SImode, -2},
+  {"workgroup_info", "enable_sgpr_workgroup_info", SImode, -1},
+#define PRIVATE_SEGMENT_WAVE_OFFSET_ARG 15
+  {"private_segment_wave_offset",
+    "enable_sgpr_private_segment_wave_byte_offset", SImode, -2},
+#define WORK_ITEM_ID_X_ARG 16
+  {"work_item_id_X", NULL, V64SImode, FIRST_VGPR_REG},
+#define WORK_ITEM_ID_Y_ARG 17
+  {"work_item_id_Y", NULL, V64SImode, FIRST_VGPR_REG + 1},
+#define WORK_ITEM_ID_Z_ARG 18
+  {"work_item_id_Z", NULL, V64SImode, FIRST_VGPR_REG + 2}
+};
+
+/* Extract parameter settings from __attribute__((amdgpu_hsa_kernel ())).
+   This function also sets the default values for some arguments.
+ 
+   Return true on success, with ARGS populated.  */
+
+static bool
+gcn_parse_amdgpu_hsa_kernel_attribute (struct gcn_kernel_args *args,
+				       tree list)
+{
+  bool err = false;
+  args->requested = ((1 << PRIVATE_SEGMENT_BUFFER_ARG)
+		     | (1 << QUEUE_PTR_ARG)
+		     | (1 << KERNARG_SEGMENT_PTR_ARG)
+		     | (1 << PRIVATE_SEGMENT_WAVE_OFFSET_ARG));
+  args->nargs = 0;
+
+  for (int a = 0; a < GCN_KERNEL_ARG_TYPES; a++)
+    args->reg[a] = -1;
+
+  for (; list; list = TREE_CHAIN (list))
+    {
+      const char *str;
+      if (TREE_CODE (TREE_VALUE (list)) != STRING_CST)
+	{
+	  error ("amdgpu_hsa_kernel attribute requires string constant "
+		 "arguments");
+	  break;
+	}
+      str = TREE_STRING_POINTER (TREE_VALUE (list));
+      int a;
+      for (a = 0; a < GCN_KERNEL_ARG_TYPES; a++)
+	{
+	  if (!strcmp (str, gcn_kernel_arg_types[a].name))
+	    break;
+	}
+      if (a == GCN_KERNEL_ARG_TYPES)
+	{
+	  error ("unknown specifier %s in amdgpu_hsa_kernel attribute", str);
+	  err = true;
+	  break;
+	}
+      if (args->requested & (1 << a))
+	{
+	  error ("duplicated parameter specifier %s in amdgpu_hsa_kernel "
+		 "attribute", str);
+	  err = true;
+	  break;
+	}
+      args->requested |= (1 << a);
+      args->order[args->nargs++] = a;
+    }
+  args->requested |= (1 << WORKGROUP_ID_X_ARG);
+  args->requested |= (1 << WORK_ITEM_ID_Z_ARG);
+
+  /* Requesting WORK_ITEM_ID_Z_ARG implies requesting WORK_ITEM_ID_X_ARG and
+     WORK_ITEM_ID_Y_ARG.  Similarly, requesting WORK_ITEM_ID_Y_ARG implies
+     requesting WORK_ITEM_ID_X_ARG.  */
+  if (args->requested & (1 << WORK_ITEM_ID_Z_ARG))
+    args->requested |= (1 << WORK_ITEM_ID_Y_ARG);
+  if (args->requested & (1 << WORK_ITEM_ID_Y_ARG))
+    args->requested |= (1 << WORK_ITEM_ID_X_ARG);
+
+  /* Always enable this so that kernargs is in a predictable place for
+     gomp_print, etc.  */
+  args->requested |= (1 << DISPATCH_PTR_ARG);
+
+  int sgpr_regno = FIRST_SGPR_REG;
+  args->nsgprs = 0;
+  for (int a = 0; a < GCN_KERNEL_ARG_TYPES; a++)
+    {
+      if (!(args->requested & (1 << a)))
+	continue;
+
+      if (gcn_kernel_arg_types[a].fixed_regno >= 0)
+	args->reg[a] = gcn_kernel_arg_types[a].fixed_regno;
+      else
+	{
+	  int reg_count;
+
+	  switch (gcn_kernel_arg_types[a].mode)
+	    {
+	    case E_SImode:
+	      reg_count = 1;
+	      break;
+	    case E_DImode:
+	      reg_count = 2;
+	      break;
+	    case E_TImode:
+	      reg_count = 4;
+	      break;
+	    default:
+	      gcc_unreachable ();
+	    }
+	  args->reg[a] = sgpr_regno;
+	  sgpr_regno += reg_count;
+	  if (gcn_kernel_arg_types[a].fixed_regno == -1)
+	    args->nsgprs += reg_count;
+	}
+    }
+  if (sgpr_regno > FIRST_SGPR_REG + 16)
+    {
+      error ("too many arguments passed in sgpr registers");
+    }
+  return err;
+}
+
+/* Referenced by TARGET_ATTRIBUTE_TABLE.
+ 
+   Validates target specific attributes.  */
+
+static tree
+gcn_handle_amdgpu_hsa_kernel_attribute (tree *node, tree name,
+					tree args, int, bool *no_add_attrs)
+{
+  if (TREE_CODE (*node) != FUNCTION_TYPE
+      && TREE_CODE (*node) != METHOD_TYPE
+      && TREE_CODE (*node) != METHOD_TYPE
+      && TREE_CODE (*node) != FIELD_DECL
+      && TREE_CODE (*node) != TYPE_DECL)
+    {
+      warning (OPT_Wattributes, "%qE attribute only applies to functions",
+	       name);
+      *no_add_attrs = true;
+      return NULL_TREE;
+    }
+
+  /* Can combine regparm with all attributes but fastcall, and thiscall.  */
+  if (is_attribute_p ("gcnhsa_kernel", name))
+    {
+      struct gcn_kernel_args kernelarg;
+
+      if (gcn_parse_amdgpu_hsa_kernel_attribute (&kernelarg, args))
+	*no_add_attrs = true;
+
+      return NULL_TREE;
+    }
+
+  return NULL_TREE;
+}
+
+/* Implement TARGET_ATTRIBUTE_TABLE.
+ 
+   Create target-specific __attribute__ types.  */
+
+static const struct attribute_spec gcn_attribute_table[] = {
+  /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
+     affects_type_identity } */
+  {"amdgpu_hsa_kernel", 0, GCN_KERNEL_ARG_TYPES, false, true,
+   true, true, gcn_handle_amdgpu_hsa_kernel_attribute, NULL},
+  /* End element.  */
+  {NULL, 0, 0, false, false, false, false, NULL, NULL}
+};
+
+/* }}}  */
+/* {{{ Registers and modes.  */
+
+/* Implement TARGET_CLASS_MAX_NREGS.
+ 
+   Return the number of hard registers needed to hold a value of MODE in
+   a register of class RCLASS.  */
+
+static unsigned char
+gcn_class_max_nregs (reg_class_t rclass, machine_mode mode)
+{
+  /* Scalar registers are 32bit, vector registers are in fact tuples of
+     64 lanes.  */
+  if (rclass == VGPR_REGS)
+    {
+      if (vgpr_1reg_mode_p (mode))
+	return 1;
+      if (vgpr_2reg_mode_p (mode))
+	return 2;
+      /* TImode is used by DImode compare_and_swap.  */
+      if (mode == TImode)
+	return 4;
+    }
+  return CEIL (GET_MODE_SIZE (mode), 4);
+}
+
+/* Implement TARGET_HARD_REGNO_NREGS.
+   
+   Return the number of hard registers needed to hold a value of MODE in
+   REGNO.  */
+
+unsigned int
+gcn_hard_regno_nregs (unsigned int regno, machine_mode mode)
+{
+  return gcn_class_max_nregs (REGNO_REG_CLASS (regno), mode);
+}
+
+/* Implement TARGET_HARD_REGNO_MODE_OK.
+   
+   Return true if REGNO can hold value in MODE.  */
+
+bool
+gcn_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
+{
+  /* Treat a complex mode as if it were a scalar mode of the same overall
+     size for the purposes of allocating hard registers.  */
+  if (COMPLEX_MODE_P (mode))
+    switch (mode)
+      {
+      case E_CQImode:
+      case E_CHImode:
+	mode = SImode;
+	break;
+      case E_CSImode:
+	mode = DImode;
+	break;
+      case E_CDImode:
+	mode = TImode;
+	break;
+      case E_HCmode:
+	mode = SFmode;
+	break;
+      case E_SCmode:
+	mode = DFmode;
+	break;
+      default:
+	/* Not supported.  */
+	return false;
+      }
+
+  switch (regno)
+    {
+    case FLAT_SCRATCH_LO_REG:
+    case XNACK_MASK_LO_REG:
+    case TBA_LO_REG:
+    case TMA_LO_REG:
+      return (mode == SImode || mode == DImode);
+    case VCC_LO_REG:
+    case EXEC_LO_REG:
+      return (mode == BImode || mode == SImode || mode == DImode
+	      /*|| mode == V32BImode || mode == V64BImode */);
+    case M0_REG:
+    case FLAT_SCRATCH_HI_REG:
+    case XNACK_MASK_HI_REG:
+    case TBA_HI_REG:
+    case TMA_HI_REG:
+      return mode == SImode;
+    case VCC_HI_REG:
+      return false;
+    case EXEC_HI_REG:
+      return mode == SImode /*|| mode == V32BImode */ ;
+    case SCC_REG:
+    case VCCZ_REG:
+    case EXECZ_REG:
+      return mode == BImode;
+    }
+  if (regno == ARG_POINTER_REGNUM || regno == FRAME_POINTER_REGNUM)
+    return true;
+  if (SGPR_REGNO_P (regno))
+    /* We restrict double register values to aligned registers.  */
+    return (sgpr_1reg_mode_p (mode)
+	    || (!((regno - FIRST_SGPR_REG) & 1) && sgpr_2reg_mode_p (mode))
+	    || (((regno - FIRST_SGPR_REG) & 3) == 0 && mode == TImode));
+  if (VGPR_REGNO_P (regno))
+    return (vgpr_1reg_mode_p (mode) || vgpr_2reg_mode_p (mode)
+	    /* TImode is used by DImode compare_and_swap.  */
+	    || mode == TImode);
+  return false;
+}
+
+/* Implement REGNO_REG_CLASS via gcn.h.
+   
+   Return smallest class containing REGNO.  */
+
+enum reg_class
+gcn_regno_reg_class (int regno)
+{
+  switch (regno)
+    {
+    case SCC_REG:
+      return SCC_CONDITIONAL_REG;
+    case VCCZ_REG:
+      return VCCZ_CONDITIONAL_REG;
+    case EXECZ_REG:
+      return EXECZ_CONDITIONAL_REG;
+    case EXEC_LO_REG:
+    case EXEC_HI_REG:
+      return EXEC_MASK_REG;
+    }
+  if (VGPR_REGNO_P (regno))
+    return VGPR_REGS;
+  if (SGPR_REGNO_P (regno))
+    return SGPR_REGS;
+  if (regno < FIRST_VGPR_REG)
+    return GENERAL_REGS;
+  if (regno == ARG_POINTER_REGNUM || regno == FRAME_POINTER_REGNUM)
+    return AFP_REGS;
+  return ALL_REGS;
+}
+
+/* Implement TARGET_CAN_CHANGE_MODE_CLASS.
+   
+   GCC assumes that lowpart contains first part of value as stored in memory.
+   This is not the case for vector registers.  */
+
+bool
+gcn_can_change_mode_class (machine_mode from, machine_mode to,
+			   reg_class_t regclass)
+{
+  if (!vgpr_vector_mode_p (from) && !vgpr_vector_mode_p (to))
+    return true;
+  return (gcn_class_max_nregs (regclass, from)
+	  == gcn_class_max_nregs (regclass, to));
+}
+
+/* Implement TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P.
+   
+   When this hook returns true for MODE, the compiler allows
+   registers explicitly used in the rtl to be used as spill registers
+   but prevents the compiler from extending the lifetime of these
+   registers.  */
+
+bool
+gcn_small_register_classes_for_mode_p (machine_mode mode)
+{
+  /* We allocate into exec and vcc regs.  Those make small register class.  */
+  return mode == DImode || mode == SImode;
+}
+
+/* Implement TARGET_CLASS_LIKELY_SPILLED_P.
+ 
+   Returns true if pseudos that have been assigned to registers of class RCLASS
+   would likely be spilled because registers of RCLASS are needed for spill
+   registers.  */
+
+static bool
+gcn_class_likely_spilled_p (reg_class_t rclass)
+{
+  return (rclass == EXEC_MASK_REG
+	  || reg_classes_intersect_p (ALL_CONDITIONAL_REGS, rclass));
+}
+
+/* Implement TARGET_MODES_TIEABLE_P.
+ 
+   Returns true if a value of MODE1 is accessible in MODE2 without
+   copying.  */
+
+bool
+gcn_modes_tieable_p (machine_mode mode1, machine_mode mode2)
+{
+  return (GET_MODE_BITSIZE (mode1) <= MAX_FIXED_MODE_SIZE
+	  && GET_MODE_BITSIZE (mode2) <= MAX_FIXED_MODE_SIZE);
+}
+
+/* Implement TARGET_TRULY_NOOP_TRUNCATION.
+ 
+   Returns true if it is safe to “convert” a value of INPREC bits to one of
+   OUTPREC bits (where OUTPREC is smaller than INPREC) by merely operating on
+   it as if it had only OUTPREC bits.  */
+
+bool
+gcn_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec)
+{
+  return ((inprec <= 32) && (outprec <= inprec));
+}
+
+/* Return N-th part of value occupying multiple registers.  */
+
+rtx
+gcn_operand_part (machine_mode mode, rtx op, int n)
+{
+  if (GET_MODE_SIZE (mode) >= 256)
+    {
+      /*gcc_assert (GET_MODE_SIZE (mode) == 256 || n == 0);  */
+
+      if (REG_P (op))
+	{
+	  gcc_assert (REGNO (op) + n < FIRST_PSEUDO_REGISTER);
+	  return gen_rtx_REG (V64SImode, REGNO (op) + n);
+	}
+      if (GET_CODE (op) == CONST_VECTOR)
+	{
+	  int units = GET_MODE_NUNITS (mode);
+	  rtvec v = rtvec_alloc (units);
+
+	  for (int i = 0; i < units; ++i)
+	    RTVEC_ELT (v, i) = gcn_operand_part (GET_MODE_INNER (mode),
+						 CONST_VECTOR_ELT (op, i), n);
+
+	  return gen_rtx_CONST_VECTOR (V64SImode, v);
+	}
+      if (GET_CODE (op) == UNSPEC && XINT (op, 1) == UNSPEC_VECTOR)
+	return gcn_gen_undef (V64SImode);
+      gcc_unreachable ();
+    }
+  else if (GET_MODE_SIZE (mode) == 8 && REG_P (op))
+    {
+      gcc_assert (REGNO (op) + n < FIRST_PSEUDO_REGISTER);
+      return gen_rtx_REG (SImode, REGNO (op) + n);
+    }
+  else
+    {
+      if (GET_CODE (op) == UNSPEC && XINT (op, 1) == UNSPEC_VECTOR)
+	return gcn_gen_undef (SImode);
+      return simplify_gen_subreg (SImode, op, mode, n * 4);
+    }
+}
+
+/* Return N-th part of value occupying multiple registers.  */
+
+rtx
+gcn_operand_doublepart (machine_mode mode, rtx op, int n)
+{
+  return simplify_gen_subreg (DImode, op, mode, n * 8);
+}
+
+/* Return true if OP can be split into subregs or high/low parts.
+   This is always true for scalars, but not normally true for vectors.
+   However, for vectors in hardregs we can use the low and high registers.  */
+
+bool
+gcn_can_split_p (machine_mode, rtx op)
+{
+  if (vgpr_vector_mode_p (GET_MODE (op)))
+    {
+      if (GET_CODE (op) == SUBREG)
+	op = SUBREG_REG (op);
+      if (!REG_P (op))
+	return true;
+      return REGNO (op) <= FIRST_PSEUDO_REGISTER;
+    }
+  return true;
+}
+
+/* Implement TARGET_SPILL_CLASS.
+   
+   Return class of registers which could be used for pseudo of MODE
+   and of class RCLASS for spilling instead of memory.  Return NO_REGS
+   if it is not possible or non-profitable.  */
+
+static reg_class_t
+gcn_spill_class (reg_class_t c, machine_mode /*mode */ )
+{
+  if (reg_classes_intersect_p (ALL_CONDITIONAL_REGS, c))
+    return SGPR_REGS;
+  else
+    return NO_REGS;
+}
+
+/* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.
+   
+   Change allocno class for given pseudo from allocno and best class
+   calculated by IRA.  */
+
+static reg_class_t
+gcn_ira_change_pseudo_allocno_class (int regno, reg_class_t cl,
+				     reg_class_t best_cl)
+{
+  /* Avoid returning classes that contain both vgpr and sgpr registers.  */
+  if (cl != ALL_REGS && cl != SRCDST_REGS && cl != ALL_GPR_REGS)
+    return cl;
+  if (best_cl != ALL_REGS && best_cl != SRCDST_REGS
+      && best_cl != ALL_GPR_REGS)
+    return best_cl;
+
+  machine_mode mode = PSEUDO_REGNO_MODE (regno);
+  if (vgpr_vector_mode_p (mode))
+    return VGPR_REGS;
+
+  return GENERAL_REGS;
+}
+
+/* Create a new DImode pseudo reg and emit an instruction to initialize
+   it to VAL.  */
+
+static rtx
+get_exec (int64_t val)
+{
+  rtx reg = gen_reg_rtx (DImode);
+  emit_insn (gen_rtx_SET (reg, gen_int_mode (val, DImode)));
+  return reg;
+}
+
+/* Return value of scalar exec register.  */
+
+rtx
+gcn_scalar_exec ()
+{
+  return const1_rtx;
+}
+
+/* Return pseudo holding scalar exec register.  */
+
+rtx
+gcn_scalar_exec_reg ()
+{
+  return get_exec (1);
+}
+
+/* Return value of full exec register.  */
+
+rtx
+gcn_full_exec ()
+{
+  return constm1_rtx;
+}
+
+/* Return pseudo holding full exec register.  */
+
+rtx
+gcn_full_exec_reg ()
+{
+  return get_exec (-1);
+}
+
+/* }}}  */
+/* {{{ Immediate constants.  */
+
+/* Initialize shared numeric constants.  */
+
+static void
+init_ext_gcn_constants (void)
+{
+  real_from_integer (&dconst4, DFmode, 4, SIGNED);
+
+  /* FIXME: this constant probably does not match what hardware really loads.
+     Reality check it eventually.  */
+  real_from_string (&dconst1over2pi,
+		    "0.1591549430918953357663423455968866839");
+  real_convert (&dconst1over2pi, SFmode, &dconst1over2pi);
+
+  ext_gcn_constants_init = 1;
+}
+
+/* Return non-zero if X is a constant that can appear as an inline operand.
+   This is 0, 0.5, -0.5, 1, -1, 2, -2, 4,-4, 1/(2*pi)
+   Or a vector of those.
+   The value returned should be the encoding of this constant.  */
+
+int
+gcn_inline_fp_constant_p (rtx x, bool allow_vector)
+{
+  machine_mode mode = GET_MODE (x);
+
+  if ((mode == V64HFmode || mode == V64SFmode || mode == V64DFmode)
+      && allow_vector)
+    {
+      int n;
+      if (GET_CODE (x) != CONST_VECTOR)
+	return 0;
+      n = gcn_inline_fp_constant_p (CONST_VECTOR_ELT (x, 0), false);
+      if (!n)
+	return 0;
+      for (int i = 1; i < 64; i++)
+	if (CONST_VECTOR_ELT (x, i) != CONST_VECTOR_ELT (x, 0))
+	  return 0;
+      return 1;
+    }
+
+  if (mode != HFmode && mode != SFmode && mode != DFmode)
+    return 0;
+
+  const REAL_VALUE_TYPE *r;
+
+  if (x == CONST0_RTX (mode))
+    return 128;
+  if (x == CONST1_RTX (mode))
+    return 242;
+
+  r = CONST_DOUBLE_REAL_VALUE (x);
+
+  if (real_identical (r, &dconstm1))
+    return 243;
+
+  if (real_identical (r, &dconsthalf))
+    return 240;
+  if (real_identical (r, &dconstm1))
+    return 243;
+  if (real_identical (r, &dconst2))
+    return 244;
+  if (real_identical (r, &dconst4))
+    return 246;
+  if (real_identical (r, &dconst1over2pi))
+    return 248;
+  if (!ext_gcn_constants_init)
+    init_ext_gcn_constants ();
+  real_value_negate (r);
+  if (real_identical (r, &dconsthalf))
+    return 241;
+  if (real_identical (r, &dconst2))
+    return 245;
+  if (real_identical (r, &dconst4))
+    return 247;
+
+  /* FIXME: add 4, -4 and 1/(2*PI).  */
+
+  return 0;
+}
+
+/* Return non-zero if X is a constant that can appear as an immediate operand.
+   This is 0, 0.5, -0.5, 1, -1, 2, -2, 4,-4, 1/(2*pi)
+   Or a vector of those.
+   The value returned should be the encoding of this constant.  */
+
+bool
+gcn_fp_constant_p (rtx x, bool allow_vector)
+{
+  machine_mode mode = GET_MODE (x);
+
+  if ((mode == V64HFmode || mode == V64SFmode || mode == V64DFmode)
+      && allow_vector)
+    {
+      int n;
+      if (GET_CODE (x) != CONST_VECTOR)
+	return false;
+      n = gcn_fp_constant_p (CONST_VECTOR_ELT (x, 0), false);
+      if (!n)
+	return false;
+      for (int i = 1; i < 64; i++)
+	if (CONST_VECTOR_ELT (x, i) != CONST_VECTOR_ELT (x, 0))
+	  return false;
+      return true;
+    }
+  if (mode != HFmode && mode != SFmode && mode != DFmode)
+    return false;
+
+  if (gcn_inline_fp_constant_p (x, false))
+    return true;
+  /* FIXME: It is not clear how 32bit immediates are interpreted here.  */
+  return (mode != DFmode);
+}
+
+/* Return true if X is a constant representable as an inline immediate
+   constant in a 32-bit instruction encoding.  */
+
+bool
+gcn_inline_constant_p (rtx x)
+{
+  if (GET_CODE (x) == CONST_INT)
+    return INTVAL (x) >= -16 && INTVAL (x) < 64;
+  if (GET_CODE (x) == CONST_DOUBLE)
+    return gcn_inline_fp_constant_p (x, false);
+  if (GET_CODE (x) == CONST_VECTOR)
+    {
+      int n;
+      if (!vgpr_vector_mode_p (GET_MODE (x))
+	  && GET_MODE (x) != V64BImode)
+	return false;
+      n = gcn_inline_constant_p (CONST_VECTOR_ELT (x, 0));
+      if (!n)
+	return false;
+      for (int i = 1; i < 64; i++)
+	if (CONST_VECTOR_ELT (x, i) != CONST_VECTOR_ELT (x, 0))
+	  return false;
+      return 1;
+    }
+  return false;
+}
+
+/* Return true if X is a constant representable as an immediate constant
+   in a 32 or 64-bit instruction encoding.  */
+
+bool
+gcn_constant_p (rtx x)
+{
+  switch (GET_CODE (x))
+    {
+    case CONST_INT:
+      return true;
+
+    case CONST_DOUBLE:
+      return gcn_fp_constant_p (x, false);
+
+    case CONST_VECTOR:
+      {
+	int n;
+	if (!vgpr_vector_mode_p (GET_MODE (x))
+	    && GET_MODE (x) != V64BImode)
+	  return false;
+	n = gcn_constant_p (CONST_VECTOR_ELT (x, 0));
+	if (!n)
+	  return false;
+	for (int i = 1; i < 64; i++)
+	  if (CONST_VECTOR_ELT (x, i) != CONST_VECTOR_ELT (x, 0))
+	    return false;
+	return true;
+      }
+
+    case SYMBOL_REF:
+    case LABEL_REF:
+      return true;
+
+    default:
+      ;
+    }
+
+  return false;
+}
+
+/* Return true if X is a constant representable as two inline immediate
+   constants in a 64-bit instruction that is split into two 32-bit
+   instructions.  */
+
+bool
+gcn_inline_constant64_p (rtx x)
+{
+  machine_mode mode;
+
+  if (GET_CODE (x) == CONST_VECTOR)
+    {
+      int n;
+      if (!vgpr_vector_mode_p (GET_MODE (x))
+	  && GET_MODE (x) != V64BImode)
+	return false;
+      if (!gcn_inline_constant64_p (CONST_VECTOR_ELT (x, 0)))
+	return false;
+      for (int i = 1; i < 64; i++)
+	if (CONST_VECTOR_ELT (x, i) != CONST_VECTOR_ELT (x, 0))
+	  return false;
+
+      return true;
+    }
+
+  if (GET_CODE (x) != CONST_INT)
+    return false;
+
+  rtx val_lo = gcn_operand_part (DImode, x, 0);
+  rtx val_hi = gcn_operand_part (DImode, x, 1);
+  return gcn_inline_constant_p (val_lo) && gcn_inline_constant_p (val_hi);
+}
+
+/* Return true if X is a constant representable as an immediate constant
+   in a 32 or 64-bit instruction encoding where the hardware will
+   extend the immediate to 64-bits.  */
+
+bool
+gcn_constant64_p (rtx x)
+{
+  if (!gcn_constant_p (x))
+    return false;
+
+  if (GET_CODE (x) != CONST_INT)
+    return true;
+
+  /* Negative numbers are only allowed if they can be encoded within src0,
+     because the 32-bit immediates do not get sign-extended.
+     Unsigned numbers must not be encodable as 32-bit -1..-16, because the
+     assembler will use a src0 inline immediate and that will get
+     sign-extended.  */
+  HOST_WIDE_INT val = INTVAL (x);
+  return (((val & 0xffffffff) == val	/* Positive 32-bit.  */
+	   && (val & 0xfffffff0) != 0xfffffff0)	/* Not -1..-16.  */
+	  || gcn_inline_constant_p (x));	/* Src0.  */
+}
+
+/* Implement TARGET_LEGITIMATE_CONSTANT_P.
+ 
+   Returns true if X is a legitimate constant for a MODE immediate operand.  */
+
+bool
+gcn_legitimate_constant_p (machine_mode, rtx x)
+{
+  return gcn_constant_p (x);
+}
+
+/* Return true if X is a CONST_VECTOR of single constant.  */
+
+static bool
+single_cst_vector_p (rtx x)
+{
+  if (GET_CODE (x) != CONST_VECTOR)
+    return false;
+  for (int i = 1; i < 64; i++)
+    if (CONST_VECTOR_ELT (x, i) != CONST_VECTOR_ELT (x, 0))
+      return false;
+  return true;
+}
+
+/* Create a CONST_VECTOR of duplicated value A.  */
+
+rtx
+gcn_vec_constant (machine_mode mode, int a)
+{
+  /*if (!a)
+    return CONST0_RTX (mode);
+  if (a == -1)
+    return CONSTM1_RTX (mode);
+  if (a == 1)
+    return CONST1_RTX (mode);
+  if (a == 2)
+    return CONST2_RTX (mode);*/
+
+  int units = GET_MODE_NUNITS (mode);
+  rtx tem = gen_int_mode (a, GET_MODE_INNER (mode));
+  rtvec v = rtvec_alloc (units);
+
+  for (int i = 0; i < units; ++i)
+    RTVEC_ELT (v, i) = tem;
+
+  return gen_rtx_CONST_VECTOR (mode, v);
+}
+
+/* Create a CONST_VECTOR of duplicated value A.  */
+
+rtx
+gcn_vec_constant (machine_mode mode, rtx a)
+{
+  int units = GET_MODE_NUNITS (mode);
+  rtvec v = rtvec_alloc (units);
+
+  for (int i = 0; i < units; ++i)
+    RTVEC_ELT (v, i) = a;
+
+  return gen_rtx_CONST_VECTOR (mode, v);
+}
+
+/* Create an undefined vector value, used where an insn operand is
+   optional.  */
+
+rtx
+gcn_gen_undef (machine_mode mode)
+{
+  return gen_rtx_UNSPEC (mode, gen_rtvec (1, const0_rtx), UNSPEC_VECTOR);
+}
+
+/* }}}  */
+/* {{{ Addresses, pointers and moves.  */
+
+/* Return true is REG is a valid place to store a pointer,
+   for instructions that require an SGPR.
+   FIXME rename. */
+
+static bool
+gcn_address_register_p (rtx reg, machine_mode mode, bool strict)
+{
+  if (GET_CODE (reg) == SUBREG)
+    reg = SUBREG_REG (reg);
+
+  if (!REG_P (reg))
+    return false;
+
+  if (GET_MODE (reg) != mode)
+    return false;
+
+  int regno = REGNO (reg);
+
+  if (regno >= FIRST_PSEUDO_REGISTER)
+    {
+      if (!strict)
+	return true;
+
+      if (!reg_renumber)
+	return false;
+
+      regno = reg_renumber[regno];
+    }
+
+  return (regno < 102 || regno == M0_REG
+	  || regno == ARG_POINTER_REGNUM || regno == FRAME_POINTER_REGNUM);
+}
+
+/* Return true is REG is a valid place to store a pointer,
+   for instructions that require a VGPR.  */
+
+static bool
+gcn_vec_address_register_p (rtx reg, machine_mode mode, bool strict)
+{
+  if (GET_CODE (reg) == SUBREG)
+    reg = SUBREG_REG (reg);
+
+  if (!REG_P (reg))
+    return false;
+
+  if (GET_MODE (reg) != mode)
+    return false;
+
+  int regno = REGNO (reg);
+
+  if (regno >= FIRST_PSEUDO_REGISTER)
+    {
+      if (!strict)
+	return true;
+
+      if (!reg_renumber)
+	return false;
+
+      regno = reg_renumber[regno];
+    }
+
+  return VGPR_REGNO_P (regno);
+}
+
+/* Return true if X would be valid inside a MEM using the Flat address
+   space.  */
+
+bool
+gcn_flat_address_p (rtx x, machine_mode mode)
+{
+  bool vec_mode = (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+		   || GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT);
+
+  if (vec_mode && gcn_address_register_p (x, DImode, false))
+    return true;
+
+  if (!vec_mode && gcn_vec_address_register_p (x, DImode, false))
+    return true;
+
+  if (TARGET_GCN5_PLUS
+      && GET_CODE (x) == PLUS
+      && gcn_vec_address_register_p (XEXP (x, 0), DImode, false)
+      && CONST_INT_P (XEXP (x, 1)))
+    return true;
+
+  return false;
+}
+
+/* Return true if X would be valid inside a MEM using the Scalar Flat
+   address space.  */
+
+bool
+gcn_scalar_flat_address_p (rtx x)
+{
+  if (gcn_address_register_p (x, DImode, false))
+    return true;
+
+  if (GET_CODE (x) == PLUS
+      && gcn_address_register_p (XEXP (x, 0), DImode, false)
+      && CONST_INT_P (XEXP (x, 1)))
+    return true;
+
+  return false;
+}
+
+/* Return true if MEM X would be valid for the Scalar Flat address space.  */
+
+bool
+gcn_scalar_flat_mem_p (rtx x)
+{
+  if (!MEM_P (x))
+    return false;
+
+  if (GET_MODE_SIZE (GET_MODE (x)) < 4)
+    return false;
+
+  return gcn_scalar_flat_address_p (XEXP (x, 0));
+}
+
+/* Return true if X would be valid inside a MEM using the LDS or GDS
+   address spaces.  */
+
+bool
+gcn_ds_address_p (rtx x)
+{
+  if (gcn_vec_address_register_p (x, SImode, false))
+    return true;
+
+  if (GET_CODE (x) == PLUS
+      && gcn_vec_address_register_p (XEXP (x, 0), SImode, false)
+      && CONST_INT_P (XEXP (x, 1)))
+    return true;
+
+  return false;
+}
+
+/* Return true if ADDR would be valid inside a MEM using the Global
+   address space.  */
+
+bool
+gcn_global_address_p (rtx addr)
+{
+  if (gcn_address_register_p (addr, DImode, false)
+      || gcn_vec_address_register_p (addr, DImode, false))
+    return true;
+
+  if (GET_CODE (addr) == PLUS)
+    {
+      rtx base = XEXP (addr, 0);
+      rtx offset = XEXP (addr, 1);
+      bool immediate_p = (CONST_INT_P (offset)
+			  && INTVAL (offset) >= -(1 << 12)
+			  && INTVAL (offset) < (1 << 12));
+
+      if ((gcn_address_register_p (base, DImode, false)
+	   || gcn_vec_address_register_p (base, DImode, false))
+	  && immediate_p)
+	/* SGPR + CONST or VGPR + CONST  */
+	return true;
+
+      if (gcn_address_register_p (base, DImode, false)
+	  && gcn_vgpr_register_operand (offset, SImode))
+	/* SPGR + VGPR  */
+	return true;
+
+      if (GET_CODE (base) == PLUS
+	  && gcn_address_register_p (XEXP (base, 0), DImode, false)
+	  && gcn_vgpr_register_operand (XEXP (base, 1), SImode)
+	  && immediate_p)
+	/* (SGPR + VGPR) + CONST  */
+	return true;
+    }
+
+  return false;
+}
+
+/* Implement TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P.
+   
+   Recognizes RTL expressions that are valid memory addresses for an
+   instruction.  The MODE argument is the machine mode for the MEM
+   expression that wants to use this address.
+
+   It only recognizes address in canonical form.  LEGITIMIZE_ADDRESS should
+   convert common non-canonical forms to canonical form so that they will
+   be recognized.  */
+
+static bool
+gcn_addr_space_legitimate_address_p (machine_mode mode, rtx x, bool strict,
+				     addr_space_t as)
+{
+  /* All vector instructions need to work on addresses in registers.  */
+  if (!TARGET_GCN5_PLUS && (vgpr_vector_mode_p (mode) && !REG_P (x)))
+    return false;
+
+  if (AS_SCALAR_FLAT_P (as))
+    switch (GET_CODE (x))
+      {
+      case REG:
+	return gcn_address_register_p (x, DImode, strict);
+      /* Addresses are in the form BASE+OFFSET
+	 OFFSET is either 20bit unsigned immediate, SGPR or M0.
+	 Writes and atomics do not accept SGPR.  */
+      case PLUS:
+	{
+	  rtx x0 = XEXP (x, 0);
+	  rtx x1 = XEXP (x, 1);
+	  if (!gcn_address_register_p (x0, DImode, strict))
+	    return false;
+	  /* FIXME: This is disabled because of the mode mismatch between
+	     SImode (for the address or m0 register) and the DImode PLUS.
+	     We'll need a zero_extend or similar.
+
+	  if (gcn_m0_register_p (x1, SImode, strict)
+	      || gcn_address_register_p (x1, SImode, strict))
+	    return true;
+	  else*/
+	  if (GET_CODE (x1) == CONST_INT)
+	    {
+	      if (INTVAL (x1) >= 0 && INTVAL (x1) < (1 << 20)
+		  /* The low bits of the offset are ignored, even when
+		     they're meant to realign the pointer.  */
+		  && !(INTVAL (x1) & 0x3))
+		return true;
+	    }
+	  return false;
+	}
+
+      default:
+	break;
+      }
+  else if (AS_SCRATCH_P (as))
+    return gcn_address_register_p (x, SImode, strict);
+  else if (AS_FLAT_P (as) || AS_FLAT_SCRATCH_P (as))
+    {
+      if (TARGET_GCN3 || GET_CODE (x) == REG)
+       return ((GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+		|| GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT)
+	       ? gcn_address_register_p (x, DImode, strict)
+	       : gcn_vec_address_register_p (x, DImode, strict));
+      else
+	{
+	  gcc_assert (TARGET_GCN5_PLUS);
+
+	  if (GET_CODE (x) == PLUS)
+	    {
+	      rtx x1 = XEXP (x, 1);
+
+	      if (VECTOR_MODE_P (mode)
+		  ? !gcn_address_register_p (x, DImode, strict)
+		  : !gcn_vec_address_register_p (x, DImode, strict))
+		return false;
+
+	      if (GET_CODE (x1) == CONST_INT)
+		{
+		  if (INTVAL (x1) >= 0 && INTVAL (x1) < (1 << 12)
+		      /* The low bits of the offset are ignored, even when
+		         they're meant to realign the pointer.  */
+		      && !(INTVAL (x1) & 0x3))
+		    return true;
+		}
+	    }
+	  return false;
+	}
+    }
+  else if (AS_GLOBAL_P (as))
+    {
+      gcc_assert (TARGET_GCN5_PLUS);
+
+      if (GET_CODE (x) == REG)
+       return (gcn_address_register_p (x, DImode, strict)
+	       || (!VECTOR_MODE_P (mode)
+		   && gcn_vec_address_register_p (x, DImode, strict)));
+      else if (GET_CODE (x) == PLUS)
+	{
+	  rtx base = XEXP (x, 0);
+	  rtx offset = XEXP (x, 1);
+
+	  bool immediate_p = (GET_CODE (offset) == CONST_INT
+			      /* Signed 13-bit immediate.  */
+			      && INTVAL (offset) >= -(1 << 12)
+			      && INTVAL (offset) < (1 << 12)
+			      /* The low bits of the offset are ignored, even
+			         when they're meant to realign the pointer.  */
+			      && !(INTVAL (offset) & 0x3));
+
+	  if (!VECTOR_MODE_P (mode))
+	    {
+	      if ((gcn_address_register_p (base, DImode, strict)
+		   || gcn_vec_address_register_p (base, DImode, strict))
+		  && immediate_p)
+		/* SGPR + CONST or VGPR + CONST  */
+		return true;
+
+	      if (gcn_address_register_p (base, DImode, strict)
+		  && gcn_vgpr_register_operand (offset, SImode))
+		/* SGPR + VGPR  */
+		return true;
+
+	      if (GET_CODE (base) == PLUS
+		  && gcn_address_register_p (XEXP (base, 0), DImode, strict)
+		  && gcn_vgpr_register_operand (XEXP (base, 1), SImode)
+		  && immediate_p)
+		/* (SGPR + VGPR) + CONST  */
+		return true;
+	    }
+	  else
+	    {
+	      if (gcn_address_register_p (base, DImode, strict)
+		  && immediate_p)
+		/* SGPR + CONST  */
+		return true;
+	    }
+	}
+      else
+	return false;
+    }
+  else if (AS_ANY_DS_P (as))
+    switch (GET_CODE (x))
+      {
+      case REG:
+	return (VECTOR_MODE_P (mode)
+		? gcn_address_register_p (x, SImode, strict)
+		: gcn_vec_address_register_p (x, SImode, strict));
+      /* Addresses are in the form BASE+OFFSET
+	 OFFSET is either 20bit unsigned immediate, SGPR or M0.
+	 Writes and atomics do not accept SGPR.  */
+      case PLUS:
+	{
+	  rtx x0 = XEXP (x, 0);
+	  rtx x1 = XEXP (x, 1);
+	  if (!gcn_vec_address_register_p (x0, DImode, strict))
+	    return false;
+	  if (GET_CODE (x1) == REG)
+	    {
+	      if (GET_CODE (x1) != REG
+		  || (REGNO (x1) <= FIRST_PSEUDO_REGISTER
+		      && !gcn_ssrc_register_operand (x1, DImode)))
+		return false;
+	    }
+	  else if (GET_CODE (x1) == CONST_VECTOR
+		   && GET_CODE (CONST_VECTOR_ELT (x1, 0)) == CONST_INT
+		   && single_cst_vector_p (x1))
+	    {
+	      x1 = CONST_VECTOR_ELT (x1, 0);
+	      if (INTVAL (x1) >= 0 && INTVAL (x1) < (1 << 20))
+		return true;
+	    }
+	  return false;
+	}
+
+      default:
+	break;
+      }
+  else
+    gcc_unreachable ();
+  return false;
+}
+
+/* Implement TARGET_ADDR_SPACE_POINTER_MODE.
+   
+   Return the appropriate mode for a named address pointer.  */
+
+static scalar_int_mode
+gcn_addr_space_pointer_mode (addr_space_t addrspace)
+{
+  switch (addrspace)
+    {
+    case ADDR_SPACE_SCRATCH:
+    case ADDR_SPACE_LDS:
+    case ADDR_SPACE_GDS:
+      return SImode;
+    case ADDR_SPACE_DEFAULT:
+    case ADDR_SPACE_FLAT:
+    case ADDR_SPACE_FLAT_SCRATCH:
+    case ADDR_SPACE_SCALAR_FLAT:
+      return DImode;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Implement TARGET_ADDR_SPACE_ADDRESS_MODE.
+   
+   Return the appropriate mode for a named address space address.  */
+
+static scalar_int_mode
+gcn_addr_space_address_mode (addr_space_t addrspace)
+{
+  return gcn_addr_space_pointer_mode (addrspace);
+}
+
+/* Implement TARGET_ADDR_SPACE_SUBSET_P.
+   
+   Determine if one named address space is a subset of another.  */
+
+static bool
+gcn_addr_space_subset_p (addr_space_t subset, addr_space_t superset)
+{
+  if (subset == superset)
+    return true;
+  /* FIXME is this true?  */
+  if (AS_FLAT_P (superset) || AS_SCALAR_FLAT_P (superset))
+    return true;
+  return false;
+}
+
+/* Convert from one address space to another.  */
+
+static rtx
+gcn_addr_space_convert (rtx op, tree from_type, tree to_type)
+{
+  gcc_assert (POINTER_TYPE_P (from_type));
+  gcc_assert (POINTER_TYPE_P (to_type));
+
+  addr_space_t as_from = TYPE_ADDR_SPACE (TREE_TYPE (from_type));
+  addr_space_t as_to = TYPE_ADDR_SPACE (TREE_TYPE (to_type));
+
+  if (AS_LDS_P (as_from) && AS_FLAT_P (as_to))
+    {
+      rtx queue = gen_rtx_REG (DImode,
+			       cfun->machine->args.reg[QUEUE_PTR_ARG]);
+      rtx group_seg_aperture_hi = gen_rtx_MEM (SImode,
+				     gen_rtx_PLUS (DImode, queue,
+						   gen_int_mode (64, SImode)));
+      rtx tmp = gen_reg_rtx (DImode);
+
+      emit_move_insn (gen_lowpart (SImode, tmp), op);
+      emit_move_insn (gen_highpart_mode (SImode, DImode, tmp),
+		      group_seg_aperture_hi);
+
+      return tmp;
+    }
+  else if (as_from == as_to)
+    return op;
+  else
+    gcc_unreachable ();
+}
+
+
+/* Implement REGNO_MODE_CODE_OK_FOR_BASE_P via gcn.h
+   
+   Retun true if REGNO is OK for memory adressing.  */
+
+bool
+gcn_regno_mode_code_ok_for_base_p (int regno,
+				   machine_mode, addr_space_t as, int, int)
+{
+  if (regno >= FIRST_PSEUDO_REGISTER)
+    {
+      if (reg_renumber)
+	regno = reg_renumber[regno];
+      else
+	return true;
+    }
+  if (AS_FLAT_P (as))
+    return (VGPR_REGNO_P (regno)
+	    || regno == ARG_POINTER_REGNUM || regno == FRAME_POINTER_REGNUM);
+  else if (AS_SCALAR_FLAT_P (as))
+    return (SGPR_REGNO_P (regno)
+	    || regno == ARG_POINTER_REGNUM || regno == FRAME_POINTER_REGNUM);
+  else if (AS_GLOBAL_P (as))
+    {
+      return (SGPR_REGNO_P (regno)
+	      || VGPR_REGNO_P (regno)
+	      || regno == ARG_POINTER_REGNUM
+	      || regno == FRAME_POINTER_REGNUM);
+    }
+  else
+    /* For now.  */
+    return false;
+}
+
+/* Implement MODE_CODE_BASE_REG_CLASS via gcn.h.
+   
+   Return a suitable register class for memory addressing.  */
+
+reg_class
+gcn_mode_code_base_reg_class (machine_mode mode, addr_space_t as, int oc,
+			      int ic)
+{
+  switch (as)
+    {
+    case ADDR_SPACE_DEFAULT:
+      return gcn_mode_code_base_reg_class (mode, DEFAULT_ADDR_SPACE, oc, ic);
+    case ADDR_SPACE_SCALAR_FLAT:
+    case ADDR_SPACE_SCRATCH:
+      return SGPR_REGS;
+      break;
+    case ADDR_SPACE_FLAT:
+    case ADDR_SPACE_FLAT_SCRATCH:
+    case ADDR_SPACE_LDS:
+    case ADDR_SPACE_GDS:
+      return ((GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+	       || GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT)
+	      ? SGPR_REGS : VGPR_REGS);
+    case ADDR_SPACE_GLOBAL:
+      return ((GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+	       || GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT)
+	      ? SGPR_REGS : ALL_GPR_REGS);
+    }
+  gcc_unreachable ();
+}
+
+/* Implement REGNO_OK_FOR_INDEX_P via gcn.h.
+   
+   Return true if REGNO is OK for index of memory addressing.  */
+
+bool
+regno_ok_for_index_p (int regno)
+{
+  if (regno >= FIRST_PSEUDO_REGISTER)
+    {
+      if (reg_renumber)
+	regno = reg_renumber[regno];
+      else
+	return true;
+    }
+  return regno == M0_REG || VGPR_REGNO_P (regno);
+}
+
+/* Generate move which uses the exec flags.  If EXEC is NULL, then it is
+   assumed that all lanes normally relevant to the mode of the move are
+   affected.  If PREV is NULL, then a sensible default is supplied for
+   the inactive lanes.  */
+
+static rtx
+gen_mov_with_exec (rtx op0, rtx op1, rtx exec = NULL, rtx prev = NULL)
+{
+  machine_mode mode = GET_MODE (op0);
+
+  if (vgpr_vector_mode_p (mode))
+    {
+      if (exec && exec != CONSTM1_RTX (DImode))
+	{
+	  if (!prev)
+	    prev = op0;
+	}
+      else
+	{
+	  if (!prev)
+	    prev = gcn_gen_undef (mode);
+	  exec = gcn_full_exec_reg ();
+	}
+
+      rtx set = gen_rtx_SET (op0, gen_rtx_VEC_MERGE (mode, op1, prev, exec));
+
+      return gen_rtx_PARALLEL (VOIDmode,
+	       gen_rtvec (2, set,
+			 gen_rtx_CLOBBER (VOIDmode,
+					  gen_rtx_SCRATCH (V64DImode))));
+    }
+
+  return (gen_rtx_PARALLEL
+	  (VOIDmode,
+	   gen_rtvec (2, gen_rtx_SET (op0, op1),
+		      gen_rtx_USE (VOIDmode,
+				   exec ? exec : gcn_scalar_exec ()))));
+}
+
+/* Generate masked move.  */
+
+static rtx
+gen_masked_scalar_load (rtx op0, rtx op1, rtx op2, rtx exec)
+{
+  return (gen_rtx_SET (op0,
+		       gen_rtx_VEC_MERGE (GET_MODE (op0),
+					  gen_rtx_VEC_DUPLICATE (GET_MODE
+								 (op0), op1),
+					  op2, exec)));
+}
+
+/* Expand vector init of OP0 by VEC.
+   Implements vec_init instruction pattern.  */
+
+void
+gcn_expand_vector_init (rtx op0, rtx vec)
+{
+  int64_t initialized_mask = 0;
+  int64_t curr_mask = 1;
+  machine_mode mode = GET_MODE (op0);
+
+  rtx val = XVECEXP (vec, 0, 0);
+
+  for (int i = 1; i < 64; i++)
+    if (rtx_equal_p (val, XVECEXP (vec, 0, i)))
+      curr_mask |= (int64_t) 1 << i;
+
+  if (gcn_constant_p (val))
+    emit_insn (gen_mov_with_exec (op0, gcn_vec_constant (mode, val)));
+  else
+    {
+      val = force_reg (GET_MODE_INNER (mode), val);
+      emit_insn (gen_masked_scalar_load (op0, val, gcn_gen_undef (mode),
+					 gcn_full_exec_reg ()));
+    }
+  initialized_mask |= curr_mask;
+  for (int i = 1; i < 64; i++)
+    if (!(initialized_mask & ((int64_t) 1 << i)))
+      {
+	curr_mask = (int64_t) 1 << i;
+	rtx val = XVECEXP (vec, 0, i);
+
+	for (int j = i + 1; j < 64; j++)
+	  if (rtx_equal_p (val, XVECEXP (vec, 0, j)))
+	    curr_mask |= (int64_t) 1 << j;
+	if (gcn_constant_p (val))
+	  emit_insn (gen_mov_with_exec (op0, gcn_vec_constant (mode, val),
+					get_exec (curr_mask)));
+	else
+	  {
+	    val = force_reg (GET_MODE_INNER (mode), val);
+	    emit_insn (gen_masked_scalar_load (op0, val, op0,
+					       get_exec (curr_mask)));
+	  }
+	initialized_mask |= curr_mask;
+      }
+}
+
+/* Load vector constant where n-th lane contains BASE+n*VAL.  */
+
+static rtx
+strided_constant (machine_mode mode, int base, int val)
+{
+  rtx x = gen_reg_rtx (mode);
+  emit_insn (gen_mov_with_exec (x, gcn_vec_constant (mode, base)));
+  emit_insn (gen_addv64si3_vector (x, x, gcn_vec_constant (mode, val * 32),
+				   get_exec (0xffffffff00000000), x));
+  emit_insn (gen_addv64si3_vector (x, x, gcn_vec_constant (mode, val * 16),
+				   get_exec (0xffff0000ffff0000), x));
+  emit_insn (gen_addv64si3_vector (x, x, gcn_vec_constant (mode, val * 8),
+				   get_exec (0xff00ff00ff00ff00), x));
+  emit_insn (gen_addv64si3_vector (x, x, gcn_vec_constant (mode, val * 4),
+				   get_exec (0xf0f0f0f0f0f0f0f0), x));
+  emit_insn (gen_addv64si3_vector (x, x, gcn_vec_constant (mode, val * 2),
+				   get_exec (0xcccccccccccccccc), x));
+  emit_insn (gen_addv64si3_vector (x, x, gcn_vec_constant (mode, val * 1),
+				   get_exec (0xaaaaaaaaaaaaaaaa), x));
+  return x;
+}
+
+/* Implement TARGET_ADDR_SPACE_LEGITIMIZE_ADDRESS.  */
+
+static rtx
+gcn_addr_space_legitimize_address (rtx x, rtx old, machine_mode mode,
+				   addr_space_t as)
+{
+  switch (as)
+    {
+    case ADDR_SPACE_DEFAULT:
+      return gcn_addr_space_legitimize_address (x, old, mode,
+						DEFAULT_ADDR_SPACE);
+    case ADDR_SPACE_SCALAR_FLAT:
+    case ADDR_SPACE_SCRATCH:
+      /* Instructions working on vectors need the address to be in
+         a register.  */
+      if (vgpr_vector_mode_p (mode))
+	return force_reg (GET_MODE (x), x);
+
+      return x;
+    case ADDR_SPACE_FLAT:
+    case ADDR_SPACE_FLAT_SCRATCH:
+    case ADDR_SPACE_GLOBAL:
+      return TARGET_GCN3 ? force_reg (DImode, x) : x;
+    case ADDR_SPACE_LDS:
+    case ADDR_SPACE_GDS:
+      /* FIXME: LDS support offsets, handle them!.  */
+      if (vgpr_vector_mode_p (mode) && GET_MODE (x) != V64SImode)
+	{
+	  rtx exec = gcn_full_exec_reg ();
+	  rtx addrs = gen_reg_rtx (V64SImode);
+	  rtx base = force_reg (SImode, x);
+	  rtx offsets = strided_constant (V64SImode, 0,
+					  GET_MODE_UNIT_SIZE (mode));
+
+	  emit_insn (gen_vec_duplicatev64si_exec
+		     (addrs, base, exec, gcn_gen_undef (V64SImode)));
+
+	  emit_insn (gen_addv64si3_vector (addrs, offsets, addrs, exec,
+					   gcn_gen_undef (V64SImode)));
+	  return addrs;
+	}
+      return x;
+    }
+  gcc_unreachable ();
+}
+
+/* Convert a (mem:<MODE> (reg:DI)) to (mem:<MODE> (reg:V64DI)) with the
+   proper vector of stepped addresses.
+
+   MEM will be a DImode address of a vector in an SGPR.
+   TMP will be a V64DImode VGPR pair or (scratch:V64DI).  */
+
+rtx
+gcn_expand_scalar_to_vector_address (machine_mode mode, rtx exec, rtx mem,
+				     rtx tmp)
+{
+  gcc_assert (MEM_P (mem));
+  rtx mem_base = XEXP (mem, 0);
+  rtx mem_index = NULL_RTX;
+
+  if (!TARGET_GCN5_PLUS)
+    {
+      /* gcn_addr_space_legitimize_address should have put the address in a
+         register.  If not, it is too late to do anything about it.  */
+      gcc_assert (REG_P (mem_base));
+    }
+
+  if (GET_CODE (mem_base) == PLUS)
+    {
+      mem_index = XEXP (mem_base, 1);
+      mem_base = XEXP (mem_base, 0);
+    }
+
+  /* RF and RM base registers for vector modes should be always an SGPR.  */
+  gcc_assert (SGPR_REGNO_P (REGNO (mem_base))
+	      || REGNO (mem_base) >= FIRST_PSEUDO_REGISTER);
+
+  machine_mode inner = GET_MODE_INNER (mode);
+  int shift = exact_log2 (GET_MODE_SIZE (inner));
+  rtx ramp = gen_rtx_REG (V64SImode, VGPR_REGNO (1));
+  rtx undef_v64si = gcn_gen_undef (V64SImode);
+  rtx new_base = NULL_RTX;
+  addr_space_t as = MEM_ADDR_SPACE (mem);
+
+  rtx tmplo = (REG_P (tmp)
+	       ? gcn_operand_part (V64DImode, tmp, 0)
+	       : gen_reg_rtx (V64SImode));
+
+  /* tmplo[:] = ramp[:] << shift  */
+  if (exec)
+    emit_insn (gen_ashlv64si3_vector (tmplo, ramp,
+				      gen_int_mode (shift, SImode),
+				      exec, undef_v64si));
+  else
+    emit_insn (gen_ashlv64si3_full (tmplo, ramp,
+				    gen_int_mode (shift, SImode)));
+
+  if (AS_FLAT_P (as))
+    {
+      if (REG_P (tmp))
+	{
+	  rtx vcc = gen_rtx_REG (DImode, CC_SAVE_REG);
+	  rtx mem_base_lo = gcn_operand_part (DImode, mem_base, 0);
+	  rtx mem_base_hi = gcn_operand_part (DImode, mem_base, 1);
+	  rtx tmphi = gcn_operand_part (V64DImode, tmp, 1);
+
+	  /* tmphi[:] = mem_base_hi  */
+	  if (exec)
+	    emit_insn (gen_vec_duplicatev64si_exec (tmphi, mem_base_hi, exec,
+						    undef_v64si));
+	  else
+	    emit_insn (gen_vec_duplicatev64si (tmphi, mem_base_hi));
+
+	  /* tmp[:] += zext (mem_base)  */
+	  if (exec)
+	    {
+	      rtx undef_di = gcn_gen_undef (DImode);
+	      emit_insn (gen_addv64si3_vector_vcc_dup (tmplo, tmplo, mem_base_lo,
+						       exec, undef_v64si, vcc,
+						       undef_di));
+	      emit_insn (gen_addcv64si3_vec (tmphi, tmphi, const0_rtx, exec,
+					     undef_v64si, vcc, vcc,
+					     gcn_vec_constant (V64SImode, 1),
+					     gcn_vec_constant (V64SImode, 0),
+					     undef_di));
+	    }
+	  else
+	    emit_insn (gen_addv64di3_scalarsi (tmp, tmp, mem_base_lo));
+	}
+      else
+	{
+	  tmp = gen_reg_rtx (V64DImode);
+	  emit_insn (gen_addv64di3_zext_dup2 (tmp, tmplo, mem_base, exec,
+					      gcn_gen_undef (V64DImode)));
+	}
+
+      new_base = tmp;
+    }
+  else if (AS_ANY_DS_P (as))
+    {
+      if (!exec)
+	exec = gen_rtx_CONST_INT (VOIDmode, -1);
+
+      emit_insn (gen_addv64si3_vector_dup (tmplo, tmplo, mem_base, exec,
+					   gcn_gen_undef (V64SImode)));
+      new_base = tmplo;
+    }
+  else
+    {
+      mem_base = gen_rtx_VEC_DUPLICATE (V64DImode, mem_base);
+      new_base = gen_rtx_PLUS (V64DImode, mem_base,
+			       gen_rtx_SIGN_EXTEND (V64DImode, tmplo));
+    }
+
+  return gen_rtx_PLUS (GET_MODE (new_base), new_base,
+		       gen_rtx_VEC_DUPLICATE (GET_MODE (new_base),
+					      (mem_index ? mem_index
+					       : const0_rtx)));
+}
+
+/* Return true if move from OP0 to OP1 is known to be executed in vector
+   unit.  */
+
+bool
+gcn_vgpr_move_p (rtx op0, rtx op1)
+{
+  if (MEM_P (op0) && AS_SCALAR_FLAT_P (MEM_ADDR_SPACE (op0)))
+    return true;
+  if (MEM_P (op1) && AS_SCALAR_FLAT_P (MEM_ADDR_SPACE (op1)))
+    return true;
+  return ((REG_P (op0) && VGPR_REGNO_P (REGNO (op0)))
+	  || (REG_P (op1) && VGPR_REGNO_P (REGNO (op1)))
+	  || vgpr_vector_mode_p (GET_MODE (op0)));
+}
+
+/* Return true if move from OP0 to OP1 is known to be executed in scalar
+   unit.  Used in the machine description.  */
+
+bool
+gcn_sgpr_move_p (rtx op0, rtx op1)
+{
+  if (MEM_P (op0) && AS_SCALAR_FLAT_P (MEM_ADDR_SPACE (op0)))
+    return true;
+  if (MEM_P (op1) && AS_SCALAR_FLAT_P (MEM_ADDR_SPACE (op1)))
+    return true;
+  if (!REG_P (op0) || REGNO (op0) >= FIRST_PSEUDO_REGISTER
+      || VGPR_REGNO_P (REGNO (op0)))
+    return false;
+  if (REG_P (op1)
+      && REGNO (op1) < FIRST_PSEUDO_REGISTER
+      && !VGPR_REGNO_P (REGNO (op1)))
+    return true;
+  return immediate_operand (op1, VOIDmode) || memory_operand (op1, VOIDmode);
+}
+
+/* Implement TARGET_SECONDARY_RELOAD.
+
+   The address space determines which registers can be used for loads and
+   stores.  */
+
+static reg_class_t
+gcn_secondary_reload (bool in_p, rtx x, reg_class_t rclass,
+		      machine_mode reload_mode, secondary_reload_info *sri)
+{
+  reg_class_t result = NO_REGS;
+  bool spilled_pseudo =
+    (REG_P (x) || GET_CODE (x) == SUBREG) && true_regnum (x) == -1;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "gcn_secondary_reload: ");
+      dump_value_slim (dump_file, x, 1);
+      fprintf (dump_file, " %s %s:%s", (in_p ? "->" : "<-"),
+	       reg_class_names[rclass], GET_MODE_NAME (reload_mode));
+      if (REG_P (x) || GET_CODE (x) == SUBREG)
+	fprintf (dump_file, " (true regnum: %d \"%s\")", true_regnum (x),
+		 (true_regnum (x) >= 0
+		  && true_regnum (x) < FIRST_PSEUDO_REGISTER
+		  ? reg_names[true_regnum (x)]
+		  : (spilled_pseudo ? "stack spill" : "??")));
+      fprintf (dump_file, "\n");
+    }
+
+  /* Some callers don't use or initialize icode.  */
+  sri->icode = CODE_FOR_nothing;
+
+  if (MEM_P (x) || spilled_pseudo)
+    {
+      addr_space_t as = DEFAULT_ADDR_SPACE;
+
+      /* If we have a spilled pseudo, we can't find the address space
+	 directly, but we know it's in ADDR_SPACE_FLAT space for GCN3 or
+	 ADDR_SPACE_GLOBAL for GCN5.  */
+      if (MEM_P (x))
+	as = MEM_ADDR_SPACE (x);
+
+      if (as == ADDR_SPACE_DEFAULT)
+	as = DEFAULT_ADDR_SPACE;
+
+      switch (as)
+	{
+	case ADDR_SPACE_SCALAR_FLAT:
+	  result =
+	    ((!MEM_P (x) || rclass == SGPR_REGS) ? NO_REGS : SGPR_REGS);
+	  break;
+	case ADDR_SPACE_FLAT:
+	case ADDR_SPACE_FLAT_SCRATCH:
+	case ADDR_SPACE_GLOBAL:
+	  if (GET_MODE_CLASS (reload_mode) == MODE_VECTOR_INT
+	      || GET_MODE_CLASS (reload_mode) == MODE_VECTOR_FLOAT)
+	    {
+	      if (in_p)
+		switch (reload_mode)
+		  {
+		  case E_V64SImode:
+		    sri->icode = CODE_FOR_reload_inv64si;
+		    break;
+		  case E_V64SFmode:
+		    sri->icode = CODE_FOR_reload_inv64sf;
+		    break;
+		  case E_V64HImode:
+		    sri->icode = CODE_FOR_reload_inv64hi;
+		    break;
+		  case E_V64HFmode:
+		    sri->icode = CODE_FOR_reload_inv64hf;
+		    break;
+		  case E_V64QImode:
+		    sri->icode = CODE_FOR_reload_inv64qi;
+		    break;
+		  case E_V64DImode:
+		    sri->icode = CODE_FOR_reload_inv64di;
+		    break;
+		  case E_V64DFmode:
+		    sri->icode = CODE_FOR_reload_inv64df;
+		    break;
+		  default:
+		    gcc_unreachable ();
+		  }
+	      else
+		switch (reload_mode)
+		  {
+		  case E_V64SImode:
+		    sri->icode = CODE_FOR_reload_outv64si;
+		    break;
+		  case E_V64SFmode:
+		    sri->icode = CODE_FOR_reload_outv64sf;
+		    break;
+		  case E_V64HImode:
+		    sri->icode = CODE_FOR_reload_outv64hi;
+		    break;
+		  case E_V64HFmode:
+		    sri->icode = CODE_FOR_reload_outv64hf;
+		    break;
+		  case E_V64QImode:
+		    sri->icode = CODE_FOR_reload_outv64qi;
+		    break;
+		  case E_V64DImode:
+		    sri->icode = CODE_FOR_reload_outv64di;
+		    break;
+		  case E_V64DFmode:
+		    sri->icode = CODE_FOR_reload_outv64df;
+		    break;
+		  default:
+		    gcc_unreachable ();
+		  }
+	      break;
+	    }
+	  /* Fallthrough.  */
+	case ADDR_SPACE_LDS:
+	case ADDR_SPACE_GDS:
+	case ADDR_SPACE_SCRATCH:
+	  result = (rclass == VGPR_REGS ? NO_REGS : VGPR_REGS);
+	  break;
+	}
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "   <= %s (icode: %s)\n", reg_class_names[result],
+	     get_insn_name (sri->icode));
+
+  return result;
+}
+
+/* Update register usage after having seen the compiler flags and kernel
+   attributes.  We typically want to fix registers that contain values
+   set by the HSA runtime.  */
+
+static void
+gcn_conditional_register_usage (void)
+{
+  int i;
+
+  /* FIXME: Do we need to reset fixed_regs?  */
+
+  if (!cfun || !cfun->machine || cfun->machine->normal_function)
+    {
+      /* Normal functions can't know what kernel argument registers are
+         live, so just fix the bottom 16 SGPRs, and bottom 3 VGPRs.  */
+      for (i = 0; i < 16; i++)
+	fixed_regs[FIRST_SGPR_REG + i] = 1;
+      for (i = 0; i < 3; i++)
+	fixed_regs[FIRST_VGPR_REG + i] = 1;
+      return;
+    }
+
+  /* Fix the runtime argument register containing values that may be
+     needed later.  DISPATCH_PTR_ARG and FLAT_SCRATCH_* should not be
+     needed after the prologue so there's no need to fix them.  */
+  if (cfun->machine->args.reg[PRIVATE_SEGMENT_WAVE_OFFSET_ARG] >= 0)
+    fixed_regs[cfun->machine->args.reg[PRIVATE_SEGMENT_WAVE_OFFSET_ARG]] = 1;
+  if (cfun->machine->args.reg[PRIVATE_SEGMENT_BUFFER_ARG] >= 0)
+    {
+      fixed_regs[cfun->machine->args.reg[PRIVATE_SEGMENT_BUFFER_ARG]] = 1;
+      fixed_regs[cfun->machine->args.reg[PRIVATE_SEGMENT_BUFFER_ARG] + 1] = 1;
+      fixed_regs[cfun->machine->args.reg[PRIVATE_SEGMENT_BUFFER_ARG] + 2] = 1;
+      fixed_regs[cfun->machine->args.reg[PRIVATE_SEGMENT_BUFFER_ARG] + 3] = 1;
+    }
+  if (cfun->machine->args.reg[KERNARG_SEGMENT_PTR_ARG] >= 0)
+    {
+      fixed_regs[cfun->machine->args.reg[KERNARG_SEGMENT_PTR_ARG]] = 1;
+      fixed_regs[cfun->machine->args.reg[KERNARG_SEGMENT_PTR_ARG] + 1] = 1;
+    }
+  if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0)
+    {
+      fixed_regs[cfun->machine->args.reg[DISPATCH_PTR_ARG]] = 1;
+      fixed_regs[cfun->machine->args.reg[DISPATCH_PTR_ARG] + 1] = 1;
+    }
+  if (cfun->machine->args.reg[WORKGROUP_ID_X_ARG] >= 0)
+    fixed_regs[cfun->machine->args.reg[WORKGROUP_ID_X_ARG]] = 1;
+  if (cfun->machine->args.reg[WORK_ITEM_ID_X_ARG] >= 0)
+    fixed_regs[cfun->machine->args.reg[WORK_ITEM_ID_X_ARG]] = 1;
+  if (cfun->machine->args.reg[WORK_ITEM_ID_Y_ARG] >= 0)
+    fixed_regs[cfun->machine->args.reg[WORK_ITEM_ID_Y_ARG]] = 1;
+  if (cfun->machine->args.reg[WORK_ITEM_ID_Z_ARG] >= 0)
+    fixed_regs[cfun->machine->args.reg[WORK_ITEM_ID_Z_ARG]] = 1;
+
+  if (TARGET_GCN5_PLUS)
+    /* v0 is always zero, for global nul-offsets.  */
+    fixed_regs[VGPR_REGNO (0)] = 1;
+}
+
+/* Determine if a load or store is valid, according to the register classes
+   and address space.  Used primarily by the machine description to decide
+   when to split a move into two steps.  */
+
+bool
+gcn_valid_move_p (machine_mode mode, rtx dest, rtx src)
+{
+  if (!MEM_P (dest) && !MEM_P (src))
+    return true;
+
+  if (MEM_P (dest)
+      && AS_FLAT_P (MEM_ADDR_SPACE (dest))
+      && (gcn_flat_address_p (XEXP (dest, 0), mode)
+	  || GET_CODE (XEXP (dest, 0)) == SYMBOL_REF
+	  || GET_CODE (XEXP (dest, 0)) == LABEL_REF)
+      && gcn_vgpr_register_operand (src, mode))
+    return true;
+  else if (MEM_P (src)
+	   && AS_FLAT_P (MEM_ADDR_SPACE (src))
+	   && (gcn_flat_address_p (XEXP (src, 0), mode)
+	       || GET_CODE (XEXP (src, 0)) == SYMBOL_REF
+	       || GET_CODE (XEXP (src, 0)) == LABEL_REF)
+	   && gcn_vgpr_register_operand (dest, mode))
+    return true;
+
+  if (MEM_P (dest)
+      && AS_GLOBAL_P (MEM_ADDR_SPACE (dest))
+      && (gcn_global_address_p (XEXP (dest, 0))
+	  || GET_CODE (XEXP (dest, 0)) == SYMBOL_REF
+	  || GET_CODE (XEXP (dest, 0)) == LABEL_REF)
+      && gcn_vgpr_register_operand (src, mode))
+    return true;
+  else if (MEM_P (src)
+	   && AS_GLOBAL_P (MEM_ADDR_SPACE (src))
+	   && (gcn_global_address_p (XEXP (src, 0))
+	       || GET_CODE (XEXP (src, 0)) == SYMBOL_REF
+	       || GET_CODE (XEXP (src, 0)) == LABEL_REF)
+	   && gcn_vgpr_register_operand (dest, mode))
+    return true;
+
+  if (MEM_P (dest)
+      && MEM_ADDR_SPACE (dest) == ADDR_SPACE_SCALAR_FLAT
+      && (gcn_scalar_flat_address_p (XEXP (dest, 0))
+	  || GET_CODE (XEXP (dest, 0)) == SYMBOL_REF
+	  || GET_CODE (XEXP (dest, 0)) == LABEL_REF)
+      && gcn_ssrc_register_operand (src, mode))
+    return true;
+  else if (MEM_P (src)
+	   && MEM_ADDR_SPACE (src) == ADDR_SPACE_SCALAR_FLAT
+	   && (gcn_scalar_flat_address_p (XEXP (src, 0))
+	       || GET_CODE (XEXP (src, 0)) == SYMBOL_REF
+	       || GET_CODE (XEXP (src, 0)) == LABEL_REF)
+	   && gcn_sdst_register_operand (dest, mode))
+    return true;
+
+  if (MEM_P (dest)
+      && AS_ANY_DS_P (MEM_ADDR_SPACE (dest))
+      && gcn_ds_address_p (XEXP (dest, 0))
+      && gcn_vgpr_register_operand (src, mode))
+    return true;
+  else if (MEM_P (src)
+	   && AS_ANY_DS_P (MEM_ADDR_SPACE (src))
+	   && gcn_ds_address_p (XEXP (src, 0))
+	   && gcn_vgpr_register_operand (dest, mode))
+    return true;
+
+  return false;
+}
+
+/* }}}  */
+/* {{{ Functions and ABI.  */
+
+/* Implement TARGET_FUNCTION_VALUE.
+   
+   Define how to find the value returned by a function.
+   The register location is always the same, but the mode depends on
+   VALTYPE.  */
+
+static rtx
+gcn_function_value (const_tree valtype, const_tree, bool)
+{
+  machine_mode mode = TYPE_MODE (valtype);
+
+  if (INTEGRAL_TYPE_P (valtype)
+      && GET_MODE_CLASS (mode) == MODE_INT
+      && GET_MODE_SIZE (mode) < 4)
+    mode = SImode;
+
+  return gen_rtx_REG (mode, SGPR_REGNO (RETURN_VALUE_REG));
+}
+
+/* Implement TARGET_FUNCTION_VALUE_REGNO_P.
+   
+   Return true if N is a possible register number for the function return
+   value.  */
+
+static bool
+gcn_function_value_regno_p (const unsigned int n)
+{
+  return n == RETURN_VALUE_REG;
+}
+
+/* Calculate the number of registers required to hold a function argument
+   of MODE and TYPE.  */
+
+static int
+num_arg_regs (machine_mode mode, const_tree type)
+{
+  int size;
+
+  if (targetm.calls.must_pass_in_stack (mode, type))
+    return 0;
+
+  if (type && mode == BLKmode)
+    size = int_size_in_bytes (type);
+  else
+    size = GET_MODE_SIZE (mode);
+
+  return (size + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
+}
+
+/* Implement TARGET_STRICT_ARGUMENT_NAMING.
+
+   Return true if the location where a function argument is passed
+   depends on whether or not it is a named argument
+
+   For gcn, we know how to handle functions declared as stdarg: by
+   passing an extra pointer to the unnamed arguments.  However, the
+   Fortran frontend can produce a different situation, where a
+   function pointer is declared with no arguments, but the actual
+   function and calls to it take more arguments.  In that case, we
+   want to ensure the call matches the definition of the function.  */
+
+static bool
+gcn_strict_argument_naming (cumulative_args_t cum_v)
+{
+  CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
+
+  return cum->fntype == NULL_TREE || stdarg_p (cum->fntype);
+}
+
+/* Implement TARGET_PRETEND_OUTGOING_VARARGS_NAMED.
+ 
+   See comment on gcn_strict_argument_naming.  */
+
+static bool
+gcn_pretend_outgoing_varargs_named (cumulative_args_t cum_v)
+{
+  return !gcn_strict_argument_naming (cum_v);
+}
+
+/* Implement TARGET_FUNCTION_ARG.
+ 
+   Return an RTX indicating whether a function argument is passed in a register
+   and if so, which register.  */
+
+static rtx
+gcn_function_arg (cumulative_args_t cum_v, machine_mode mode, const_tree type,
+		  bool named)
+{
+  CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
+  if (cum->normal_function)
+    {
+      if (!named || mode == VOIDmode)
+	return 0;
+
+      if (targetm.calls.must_pass_in_stack (mode, type))
+	return 0;
+
+      int reg_num = FIRST_PARM_REG + cum->num;
+      int num_regs = num_arg_regs (mode, type);
+      if (num_regs > 0)
+	while (reg_num % num_regs != 0)
+	  reg_num++;
+      if (reg_num + num_regs <= FIRST_PARM_REG + NUM_PARM_REGS)
+	return gen_rtx_REG (mode, reg_num);
+    }
+  else
+    {
+      if (cum->num >= cum->args.nargs)
+	{
+	  cum->offset = (cum->offset + TYPE_ALIGN (type) / 8 - 1)
+	    & -(TYPE_ALIGN (type) / 8);
+	  cfun->machine->kernarg_segment_alignment
+	    = MAX ((unsigned) cfun->machine->kernarg_segment_alignment,
+		   TYPE_ALIGN (type) / 8);
+	  rtx addr = gen_rtx_REG (DImode,
+				  cum->args.reg[KERNARG_SEGMENT_PTR_ARG]);
+	  if (cum->offset)
+	    addr = gen_rtx_PLUS (DImode, addr,
+				 gen_int_mode (cum->offset, DImode));
+	  rtx mem = gen_rtx_MEM (mode, addr);
+	  set_mem_attributes (mem, const_cast<tree>(type), 1);
+	  set_mem_addr_space (mem, ADDR_SPACE_SCALAR_FLAT);
+	  MEM_READONLY_P (mem) = 1;
+	  return mem;
+	}
+
+      int a = cum->args.order[cum->num];
+      if (mode != gcn_kernel_arg_types[a].mode)
+	{
+	  error ("wrong type of argument %s", gcn_kernel_arg_types[a].name);
+	  return 0;
+	}
+      return gen_rtx_REG ((machine_mode) gcn_kernel_arg_types[a].mode,
+			  cum->args.reg[a]);
+    }
+  return 0;
+}
+
+/* Implement TARGET_FUNCTION_ARG_ADVANCE.
+ 
+   Updates the summarizer variable pointed to by CUM_V to advance past an
+   argument in the argument list.  */
+
+static void
+gcn_function_arg_advance (cumulative_args_t cum_v, machine_mode mode,
+			  const_tree type, bool named)
+{
+  CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
+
+  if (cum->normal_function)
+    {
+      if (!named)
+	return;
+
+      int num_regs = num_arg_regs (mode, type);
+      if (num_regs > 0)
+	while ((FIRST_PARM_REG + cum->num) % num_regs != 0)
+	  cum->num++;
+      cum->num += num_regs;
+    }
+  else
+    {
+      if (cum->num < cum->args.nargs)
+	cum->num++;
+      else
+	{
+	  cum->offset += tree_to_uhwi (TYPE_SIZE_UNIT (type));
+	  cfun->machine->kernarg_segment_byte_size = cum->offset;
+	}
+    }
+}
+
+/* Implement TARGET_ARG_PARTIAL_BYTES.
+ 
+   Returns the number of bytes at the beginning of an argument that must be put
+   in registers.  The value must be zero for arguments that are passed entirely
+   in registers or that are entirely pushed on the stack.  */
+
+static int
+gcn_arg_partial_bytes (cumulative_args_t cum_v, machine_mode mode, tree type,
+		       bool named)
+{
+  CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
+
+  if (!named)
+    return 0;
+
+  if (targetm.calls.must_pass_in_stack (mode, type))
+    return 0;
+
+  if (cum->num >= NUM_PARM_REGS)
+    return 0;
+
+  /* If the argument fits entirely in registers, return 0.  */
+  if (cum->num + num_arg_regs (mode, type) <= NUM_PARM_REGS)
+    return 0;
+
+  return (NUM_PARM_REGS - cum->num) * UNITS_PER_WORD;
+}
+
+/* A normal function which takes a pointer argument (to a scalar) may be
+   passed a pointer to LDS space (via a high-bits-set aperture), and that only
+   works with FLAT addressing, not GLOBAL.  Force FLAT addressing if the
+   function has an incoming pointer-to-scalar parameter.  */
+
+static void
+gcn_detect_incoming_pointer_arg (tree fndecl)
+{
+  gcc_assert (cfun && cfun->machine);
+
+  for (tree arg = TYPE_ARG_TYPES (TREE_TYPE (fndecl));
+       arg;
+       arg = TREE_CHAIN (arg))
+    if (POINTER_TYPE_P (TREE_VALUE (arg))
+	&& !AGGREGATE_TYPE_P (TREE_TYPE (TREE_VALUE (arg))))
+      cfun->machine->use_flat_addressing = true;
+}
+
+/* Implement INIT_CUMULATIVE_ARGS, via gcn.h.
+   
+   Initialize a variable CUM of type CUMULATIVE_ARGS for a call to a function
+   whose data type is FNTYPE.  For a library call, FNTYPE is 0.  */
+
+void
+gcn_init_cumulative_args (CUMULATIVE_ARGS *cum /* Argument info to init */ ,
+			  tree fntype /* tree ptr for function decl */ ,
+			  rtx libname /* SYMBOL_REF of library name or 0 */ ,
+			  tree fndecl, int caller)
+{
+  memset (cum, 0, sizeof (*cum));
+  cum->fntype = fntype;
+  if (libname)
+    {
+      gcc_assert (cfun && cfun->machine);
+      cum->normal_function = true;
+      if (!caller)
+	{
+	  cfun->machine->normal_function = true;
+	  gcn_detect_incoming_pointer_arg (fndecl);
+	}
+      return;
+    }
+  tree attr = NULL;
+  if (fndecl)
+    attr = lookup_attribute ("amdgpu_hsa_kernel", DECL_ATTRIBUTES (fndecl));
+  if (fndecl && !attr)
+    attr = lookup_attribute ("amdgpu_hsa_kernel",
+			     TYPE_ATTRIBUTES (TREE_TYPE (fndecl)));
+  if (!attr && fntype)
+    attr = lookup_attribute ("amdgpu_hsa_kernel", TYPE_ATTRIBUTES (fntype));
+  /* Handle main () as kernel, so we can run testsuite.
+     Handle OpenACC kernels similarly to main.  */
+  if (!attr && !caller && fndecl
+      && (MAIN_NAME_P (DECL_NAME (fndecl))
+	  || lookup_attribute ("omp target entrypoint",
+			       DECL_ATTRIBUTES (fndecl)) != NULL_TREE))
+    gcn_parse_amdgpu_hsa_kernel_attribute (&cum->args, NULL_TREE);
+  else
+    {
+      if (!attr || caller)
+	{
+	  gcc_assert (cfun && cfun->machine);
+	  cum->normal_function = true;
+	  if (!caller)
+	    cfun->machine->normal_function = true;
+	}
+      gcn_parse_amdgpu_hsa_kernel_attribute
+	(&cum->args, attr ? TREE_VALUE (attr) : NULL_TREE);
+    }
+  cfun->machine->args = cum->args;
+  if (!caller && cfun->machine->normal_function)
+    gcn_detect_incoming_pointer_arg (fndecl);
+}
+
+static bool
+gcn_return_in_memory (const_tree type, const_tree ARG_UNUSED (fntype))
+{
+  machine_mode mode = TYPE_MODE (type);
+  HOST_WIDE_INT size = int_size_in_bytes (type);
+
+  if (AGGREGATE_TYPE_P (type))
+    return true;
+
+  if (mode == BLKmode)
+    return true;
+
+  if (size > 2 * UNITS_PER_WORD)
+    return true;
+
+  return false;
+}
+
+/* Implement TARGET_PROMOTE_FUNCTION_MODE.
+ 
+   Return the mode to use for outgoing function arguments.  */
+
+machine_mode
+gcn_promote_function_mode (const_tree ARG_UNUSED (type), machine_mode mode,
+			   int *ARG_UNUSED (punsignedp),
+			   const_tree ARG_UNUSED (funtype),
+			   int ARG_UNUSED (for_return))
+{
+  if (GET_MODE_CLASS (mode) == MODE_INT && GET_MODE_SIZE (mode) < 4)
+    return SImode;
+
+  return mode;
+}
+
+/* Implement TARGET_GIMPLIFY_VA_ARG_EXPR.
+   
+   Derived from hppa_gimplify_va_arg_expr.  The generic routine doesn't handle
+   ARGS_GROW_DOWNWARDS.  */
+
+static tree
+gcn_gimplify_va_arg_expr (tree valist, tree type,
+			  gimple_seq *ARG_UNUSED (pre_p),
+			  gimple_seq *ARG_UNUSED (post_p))
+{
+  tree ptr = build_pointer_type (type);
+  tree valist_type;
+  tree t, u;
+  bool indirect;
+
+  indirect = pass_by_reference (NULL, TYPE_MODE (type), type, 0);
+  if (indirect)
+    {
+      type = ptr;
+      ptr = build_pointer_type (type);
+    }
+  valist_type = TREE_TYPE (valist);
+
+  /* Args grow down.  Not handled by generic routines.  */
+
+  u = fold_convert (sizetype, size_in_bytes (type));
+  u = fold_build1 (NEGATE_EXPR, sizetype, u);
+  t = fold_build_pointer_plus (valist, u);
+
+  /* Align to 8 byte boundary.  */
+
+  u = build_int_cst (TREE_TYPE (t), -8);
+  t = build2 (BIT_AND_EXPR, TREE_TYPE (t), t, u);
+  t = fold_convert (valist_type, t);
+
+  t = build2 (MODIFY_EXPR, valist_type, valist, t);
+
+  t = fold_convert (ptr, t);
+  t = build_va_arg_indirect_ref (t);
+
+  if (indirect)
+    t = build_va_arg_indirect_ref (t);
+
+  return t;
+}
+
+/* Calculate stack offsets needed to create prologues and epilogues.  */
+
+static struct machine_function *
+gcn_compute_frame_offsets (void)
+{
+  machine_function *offsets = cfun->machine;
+
+  if (reload_completed)
+    return offsets;
+
+  offsets->need_frame_pointer = frame_pointer_needed;
+
+  offsets->outgoing_args_size = crtl->outgoing_args_size;
+  offsets->pretend_size = crtl->args.pretend_args_size;
+
+  offsets->local_vars = get_frame_size ();
+
+  offsets->lr_needs_saving = (!leaf_function_p ()
+			      || df_regs_ever_live_p (LR_REGNUM)
+			      || df_regs_ever_live_p (LR_REGNUM + 1));
+
+  offsets->callee_saves = offsets->lr_needs_saving ? 8 : 0;
+
+  for (int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if ((df_regs_ever_live_p (regno) && !call_used_regs[regno])
+	|| ((regno & ~1) == HARD_FRAME_POINTER_REGNUM
+	    && frame_pointer_needed))
+      offsets->callee_saves += (VGPR_REGNO_P (regno) ? 256 : 4);
+
+  /* Round up to 64-bit boundary to maintain stack alignment.  */
+  offsets->callee_saves = (offsets->callee_saves + 7) & ~7;
+
+  return offsets;
+}
+
+/* Insert code into the prologue or epilogue to store or load any
+   callee-save register to/from the stack.
+ 
+   Helper function for gcn_expand_prologue and gcn_expand_epilogue.  */
+
+static void
+move_callee_saved_registers (rtx sp, machine_function *offsets,
+			     bool prologue)
+{
+  int regno, offset, saved_scalars;
+  rtx exec = gen_rtx_REG (DImode, EXEC_REG);
+  rtx vcc = gen_rtx_REG (DImode, VCC_LO_REG);
+  rtx offreg = gen_rtx_REG (SImode, SGPR_REGNO (22));
+  rtx as = gen_rtx_CONST_INT (VOIDmode, STACK_ADDR_SPACE);
+  HOST_WIDE_INT exec_set = 0;
+  int offreg_set = 0;
+
+  start_sequence ();
+
+  /* Move scalars into two vector registers.  */
+  for (regno = 0, saved_scalars = 0; regno < FIRST_VGPR_REG; regno++)
+    if ((df_regs_ever_live_p (regno) && !call_used_regs[regno])
+	|| ((regno & ~1) == LINK_REGNUM && offsets->lr_needs_saving)
+	|| ((regno & ~1) == HARD_FRAME_POINTER_REGNUM
+	    && offsets->need_frame_pointer))
+      {
+	rtx reg = gen_rtx_REG (SImode, regno);
+	rtx vreg = gen_rtx_REG (V64SImode,
+				VGPR_REGNO (6 + (saved_scalars / 64)));
+	int lane = saved_scalars % 64;
+
+	if (prologue)
+	  emit_insn (gen_vec_setv64si (vreg, reg, GEN_INT (lane)));
+	else
+	  emit_insn (gen_vec_extractv64sisi (reg, vreg, GEN_INT (lane)));
+
+	saved_scalars++;
+      }
+
+  rtx move_scalars = get_insns ();
+  end_sequence ();
+  start_sequence ();
+
+  /* Ensure that all vector lanes are moved.  */
+  exec_set = -1;
+  emit_move_insn (exec, GEN_INT (exec_set));
+
+  /* Set up a vector stack pointer.  */
+  rtx _0_1_2_3 = gen_rtx_REG (V64SImode, VGPR_REGNO (1));
+  rtx _0_4_8_12 = gen_rtx_REG (V64SImode, VGPR_REGNO (3));
+  emit_insn (gen_ashlv64si3_vector (_0_4_8_12, _0_1_2_3, GEN_INT (2), exec,
+				    gcn_gen_undef (V64SImode)));
+  rtx vsp = gen_rtx_REG (V64DImode, VGPR_REGNO (4));
+  emit_insn (gen_vec_duplicatev64di_exec (vsp, sp, exec,
+					  gcn_gen_undef (V64DImode)));
+  emit_insn (gen_addv64si3_vector_vcc (gcn_operand_part (V64SImode, vsp, 0),
+				       gcn_operand_part (V64SImode, vsp, 0),
+				       _0_4_8_12, exec,
+				       gcn_gen_undef (V64SImode), vcc,
+				       gcn_gen_undef (DImode)));
+  emit_insn (gen_addcv64si3_vec (gcn_operand_part (V64SImode, vsp, 1),
+				 gcn_operand_part (V64SImode, vsp, 1),
+				 const0_rtx, exec, gcn_gen_undef (V64SImode),
+				 vcc, vcc, gcn_vec_constant (V64SImode, 1),
+				 gcn_vec_constant (V64SImode, 0),
+				 gcn_gen_undef (DImode)));
+
+  /* Move vectors.  */
+  for (regno = FIRST_VGPR_REG, offset = offsets->pretend_size;
+       regno < FIRST_PSEUDO_REGISTER; regno++)
+    if ((df_regs_ever_live_p (regno) && !call_used_regs[regno])
+	|| (regno == VGPR_REGNO (6) && saved_scalars > 0)
+	|| (regno == VGPR_REGNO (7) && saved_scalars > 63))
+      {
+	rtx reg = gen_rtx_REG (V64SImode, regno);
+	int size = 256;
+
+	if (regno == VGPR_REGNO (6) && saved_scalars < 64)
+	  size = saved_scalars * 4;
+	else if (regno == VGPR_REGNO (7) && saved_scalars < 128)
+	  size = (saved_scalars - 64) * 4;
+
+	if (size != 256 || exec_set != -1)
+	  {
+	    exec_set = ((unsigned HOST_WIDE_INT) 1 << (size / 4)) - 1;
+	    emit_move_insn (exec, gen_int_mode (exec_set, DImode));
+	  }
+
+	if (prologue)
+	  emit_insn (gen_scatterv64si_insn_1offset (vsp, const0_rtx, reg, as,
+						    const0_rtx, exec));
+	else
+	  emit_insn (gen_gatherv64si_insn_1offset (reg, vsp, const0_rtx, as,
+						   const0_rtx,
+						   gcn_gen_undef (V64SImode),
+						   exec));
+
+	/* Move our VSP to the next stack entry.  */
+	if (offreg_set != size)
+	  {
+	    offreg_set = size;
+	    emit_move_insn (offreg, GEN_INT (size));
+	  }
+	if (exec_set != -1)
+	  {
+	    exec_set = -1;
+	    emit_move_insn (exec, GEN_INT (exec_set));
+	  }
+	emit_insn (gen_addv64si3_vector_vcc_dup
+		   (gcn_operand_part (V64SImode, vsp, 0),
+		    gcn_operand_part (V64SImode, vsp, 0),
+		    offreg, exec, gcn_gen_undef (V64SImode),
+		    vcc, gcn_gen_undef (DImode)));
+	emit_insn (gen_addcv64si3_vec
+		   (gcn_operand_part (V64SImode, vsp, 1),
+		    gcn_operand_part (V64SImode, vsp, 1),
+		    const0_rtx, exec, gcn_gen_undef (V64SImode),
+		    vcc, vcc, gcn_vec_constant (V64SImode, 1),
+		    gcn_vec_constant (V64SImode, 0), gcn_gen_undef (DImode)));
+
+	offset += size;
+      }
+
+  rtx move_vectors = get_insns ();
+  end_sequence ();
+
+  if (prologue)
+    {
+      emit_insn (move_scalars);
+      emit_insn (move_vectors);
+    }
+  else
+    {
+      emit_insn (move_vectors);
+      emit_insn (move_scalars);
+    }
+}
+
+/* Generate prologue.  Called from gen_prologue during pro_and_epilogue pass.
+
+   For a non-kernel function, the stack layout looks like this (interim),
+   growing *upwards*:
+
+ hi | + ...
+    |__________________| <-- current SP
+    | outgoing args    |
+    |__________________|
+    | (alloca space)   |
+    |__________________|
+    | local vars       |
+    |__________________| <-- FP/hard FP
+    | callee-save regs |
+    |__________________| <-- soft arg pointer
+    | pretend args     |
+    |__________________| <-- incoming SP
+    | incoming args    |
+ lo |..................|
+
+   This implies arguments (beyond the first N in registers) must grow
+   downwards (as, apparently, PA has them do).
+
+   For a kernel function we have the simpler:
+
+ hi | + ...
+    |__________________| <-- current SP
+    | outgoing args    |
+    |__________________|
+    | (alloca space)   |
+    |__________________|
+    | local vars       |
+ lo |__________________| <-- FP/hard FP
+
+*/
+
+void
+gcn_expand_prologue ()
+{
+  machine_function *offsets = gcn_compute_frame_offsets ();
+
+  if (!cfun || !cfun->machine || cfun->machine->normal_function)
+    {
+      rtx sp = gen_rtx_REG (Pmode, STACK_POINTER_REGNUM);
+      rtx fp = gen_rtx_REG (Pmode, HARD_FRAME_POINTER_REGNUM);
+
+      start_sequence ();
+
+      if (offsets->pretend_size > 0)
+	{
+	  /* FIXME: Do the actual saving of register pretend args to the stack.
+	     Register order needs consideration.  */
+	}
+
+      /* Save callee-save regs.  */
+      move_callee_saved_registers (sp, offsets, true);
+
+      HOST_WIDE_INT sp_adjust = offsets->pretend_size
+	+ offsets->callee_saves
+	+ offsets->local_vars + offsets->outgoing_args_size;
+      if (sp_adjust > 0)
+	emit_insn (gen_adddi3 (sp, sp, gen_int_mode (sp_adjust, DImode)));
+
+      if (offsets->need_frame_pointer)
+	emit_insn (gen_adddi3 (fp, sp,
+			       gen_int_mode (-(offsets->local_vars +
+					       offsets->outgoing_args_size),
+					     DImode)));
+
+      rtx_insn *seq = get_insns ();
+      end_sequence ();
+
+      /* FIXME: Prologue insns should have this flag set for debug output, etc.
+	 but it causes issues for now.
+      for (insn = seq; insn; insn = NEXT_INSN (insn))
+        if (INSN_P (insn))
+	  RTX_FRAME_RELATED_P (insn) = 1;*/
+
+      emit_insn (seq);
+    }
+  else
+    {
+      rtx wave_offset = gen_rtx_REG (SImode,
+				     cfun->machine->args.
+				     reg[PRIVATE_SEGMENT_WAVE_OFFSET_ARG]);
+
+      if (TARGET_GCN5_PLUS)
+	{
+	  /* v0 is reserved for constant zero so that "global"
+	     memory instructions can have a nul-offset without
+	     causing reloads.  */
+	  rtx exec = gen_rtx_REG (DImode, EXEC_REG);
+	  emit_move_insn (exec, GEN_INT (-1));
+	  emit_insn (gen_vec_duplicatev64si_exec
+		     (gen_rtx_REG (V64SImode, VGPR_REGNO (0)),
+		      const0_rtx, exec, gcn_gen_undef (V64SImode)));
+	}
+
+      if (cfun->machine->args.requested & (1 << FLAT_SCRATCH_INIT_ARG))
+	{
+	  rtx fs_init_lo =
+	    gen_rtx_REG (SImode,
+			 cfun->machine->args.reg[FLAT_SCRATCH_INIT_ARG]);
+	  rtx fs_init_hi =
+	    gen_rtx_REG (SImode,
+			 cfun->machine->args.reg[FLAT_SCRATCH_INIT_ARG] + 1);
+	  rtx fs_reg_lo = gen_rtx_REG (SImode, FLAT_SCRATCH_REG);
+	  rtx fs_reg_hi = gen_rtx_REG (SImode, FLAT_SCRATCH_REG + 1);
+
+	  /*rtx queue = gen_rtx_REG(DImode,
+				  cfun->machine->args.reg[QUEUE_PTR_ARG]);
+	  rtx aperture = gen_rtx_MEM (SImode,
+				      gen_rtx_PLUS (DImode, queue,
+						    gen_int_mode (68, SImode)));
+	  set_mem_addr_space (aperture, ADDR_SPACE_SCALAR_FLAT);*/
+
+	  /* Set up flat_scratch.  */
+	  emit_insn (gen_addsi3 (fs_reg_hi, fs_init_lo, wave_offset));
+	  emit_insn (gen_lshrsi3_scalar (fs_reg_hi, fs_reg_hi,
+					 gen_int_mode (8, SImode)));
+	  emit_move_insn (fs_reg_lo, fs_init_hi);
+	}
+
+      /* Set up frame pointer and stack pointer.  */
+      rtx sp = gen_rtx_REG (DImode, STACK_POINTER_REGNUM);
+      rtx fp = gen_rtx_REG (DImode, HARD_FRAME_POINTER_REGNUM);
+      rtx fp_hi = simplify_gen_subreg (SImode, fp, DImode, 4);
+      rtx fp_lo = simplify_gen_subreg (SImode, fp, DImode, 0);
+
+      HOST_WIDE_INT sp_adjust = (offsets->local_vars
+				 + offsets->outgoing_args_size);
+
+      /* Initialise FP and SP from the buffer descriptor in s[0:3].  */
+      emit_move_insn (fp_lo, gen_rtx_REG (SImode, 0));
+      emit_insn (gen_andsi3 (fp_hi, gen_rtx_REG (SImode, 1),
+			     gen_int_mode (0xffff, SImode)));
+      emit_insn (gen_addsi3 (fp_lo, fp_lo, wave_offset));
+      emit_insn (gen_addcsi3_scalar_zero (fp_hi, fp_hi,
+					  gen_rtx_REG (BImode, SCC_REG)));
+
+      if (sp_adjust > 0)
+	emit_insn (gen_adddi3 (sp, fp, gen_int_mode (sp_adjust, DImode)));
+      else
+	emit_move_insn (sp, fp);
+
+      /* Make sure the flat scratch reg doesn't get optimised away.  */
+      emit_insn (gen_prologue_use (gen_rtx_REG (DImode, FLAT_SCRATCH_REG)));
+    }
+
+  emit_move_insn (gen_rtx_REG (SImode, M0_REG),
+		  gen_int_mode (LDS_SIZE, SImode));
+
+  emit_insn (gen_prologue_use (gen_rtx_REG (SImode, M0_REG)));
+  if (TARGET_GCN5_PLUS)
+    emit_insn (gen_prologue_use (gen_rtx_REG (SImode, VGPR_REGNO (0))));
+
+  if (cfun && cfun->machine && !cfun->machine->normal_function && flag_openmp)
+    {
+      /* OpenMP kernels have an implicit call to gomp_gcn_enter_kernel.  */
+      rtx fn_reg = gen_rtx_REG (Pmode, FIRST_PARM_REG);
+      emit_move_insn (fn_reg, gen_rtx_SYMBOL_REF (Pmode,
+						  "gomp_gcn_enter_kernel"));
+      emit_call_insn (gen_gcn_indirect_call (fn_reg, const0_rtx));
+    }
+}
+
+/* Generate epilogue.  Called from gen_epilogue during pro_and_epilogue pass.
+
+   See gcn_expand_prologue for stack details.  */
+
+void
+gcn_expand_epilogue (void)
+{
+  if (!cfun || !cfun->machine || cfun->machine->normal_function)
+    {
+      machine_function *offsets = gcn_compute_frame_offsets ();
+      rtx sp = gen_rtx_REG (Pmode, STACK_POINTER_REGNUM);
+      rtx fp = gen_rtx_REG (Pmode, HARD_FRAME_POINTER_REGNUM);
+
+      HOST_WIDE_INT sp_adjust = offsets->callee_saves + offsets->pretend_size;
+
+      if (offsets->need_frame_pointer)
+	{
+	  /* Restore old SP from the frame pointer.  */
+	  if (sp_adjust > 0)
+	    emit_insn (gen_subdi3 (sp, fp, gen_int_mode (sp_adjust, DImode)));
+	  else
+	    emit_move_insn (sp, fp);
+	}
+      else
+	{
+	  /* Restore old SP from current SP.  */
+	  sp_adjust += offsets->outgoing_args_size + offsets->local_vars;
+
+	  if (sp_adjust > 0)
+	    emit_insn (gen_subdi3 (sp, sp, gen_int_mode (sp_adjust, DImode)));
+	}
+
+      move_callee_saved_registers (sp, offsets, false);
+
+      /* There's no explicit use of the link register on the return insn.  Emit
+         one here instead.  */
+      if (offsets->lr_needs_saving)
+	emit_use (gen_rtx_REG (DImode, LINK_REGNUM));
+
+      /* Similar for frame pointer.  */
+      if (offsets->need_frame_pointer)
+	emit_use (gen_rtx_REG (DImode, HARD_FRAME_POINTER_REGNUM));
+    }
+  else if (flag_openmp)
+    {
+      /* OpenMP kernels have an implicit call to gomp_gcn_exit_kernel.  */
+      rtx fn_reg = gen_rtx_REG (Pmode, FIRST_PARM_REG);
+      emit_move_insn (fn_reg,
+		      gen_rtx_SYMBOL_REF (Pmode, "gomp_gcn_exit_kernel"));
+      emit_call_insn (gen_gcn_indirect_call (fn_reg, const0_rtx));
+    }
+  else if (TREE_CODE (TREE_TYPE (DECL_RESULT (cfun->decl))) != VOID_TYPE)
+    {
+      /* Assume that an exit value compatible with gcn-run is expected.
+         That is, the third input parameter is an int*.
+
+         We can't allocate any new registers, but the kernarg_reg is
+         dead after this, so we'll use that.  */
+      rtx kernarg_reg = gen_rtx_REG (DImode, cfun->machine->args.reg
+				     [KERNARG_SEGMENT_PTR_ARG]);
+      rtx retptr_mem = gen_rtx_MEM (DImode,
+				    gen_rtx_PLUS (DImode, kernarg_reg,
+						  GEN_INT (16)));
+      set_mem_addr_space (retptr_mem, ADDR_SPACE_SCALAR_FLAT);
+      emit_move_insn (kernarg_reg, retptr_mem);
+
+      rtx retval_mem = gen_rtx_MEM (SImode, kernarg_reg);
+      set_mem_addr_space (retval_mem, ADDR_SPACE_SCALAR_FLAT);
+      emit_move_insn (retval_mem,
+		      gen_rtx_REG (SImode, SGPR_REGNO (RETURN_VALUE_REG)));
+    }
+
+  emit_jump_insn (gen_gcn_return ());
+}
+
+/* Implement TARGET_CAN_ELIMINATE.
+ 
+   Return true if the compiler is allowed to try to replace register number
+   FROM_REG with register number TO_REG.
+ 
+   FIXME: is the default "true" not enough? Should this be a negative set?  */
+
+bool
+gcn_can_eliminate_p (int /*from_reg */ , int to_reg)
+{
+  return (to_reg == HARD_FRAME_POINTER_REGNUM
+	  || to_reg == STACK_POINTER_REGNUM);
+}
+
+/* Implement INITIAL_ELIMINATION_OFFSET.
+ 
+   Returns the initial difference between the specified pair of registers, in
+   terms of stack position.  */
+
+HOST_WIDE_INT
+gcn_initial_elimination_offset (int from, int to)
+{
+  machine_function *offsets = gcn_compute_frame_offsets ();
+
+  switch (from)
+    {
+    case ARG_POINTER_REGNUM:
+      if (to == STACK_POINTER_REGNUM)
+	return -(offsets->callee_saves + offsets->local_vars
+		 + offsets->outgoing_args_size);
+      else if (to == FRAME_POINTER_REGNUM || to == HARD_FRAME_POINTER_REGNUM)
+	return -offsets->callee_saves;
+      else
+	gcc_unreachable ();
+      break;
+
+    case FRAME_POINTER_REGNUM:
+      if (to == STACK_POINTER_REGNUM)
+	return -(offsets->local_vars + offsets->outgoing_args_size);
+      else if (to == HARD_FRAME_POINTER_REGNUM)
+	return 0;
+      else
+	gcc_unreachable ();
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Implement HARD_REGNO_RENAME_OK.
+
+   Return true if it is permissible to rename a hard register from
+   FROM_REG to TO_REG.  */
+
+bool
+gcn_hard_regno_rename_ok (unsigned int from_reg, unsigned int to_reg)
+{
+  if (SPECIAL_REGNO_P (from_reg) || SPECIAL_REGNO_P (to_reg))
+    return false;
+
+  /* Allow the link register to be used if it was saved.  */
+  if ((to_reg & ~1) == LINK_REGNUM)
+    return !cfun || cfun->machine->lr_needs_saving;
+
+  /* Allow the registers used for the static chain to be used if the chain is
+     not in active use.  */
+  if ((to_reg & ~1) == STATIC_CHAIN_REGNUM)
+    return !cfun
+	|| !(cfun->static_chain_decl
+	     && df_regs_ever_live_p (STATIC_CHAIN_REGNUM)
+	     && df_regs_ever_live_p (STATIC_CHAIN_REGNUM + 1));
+
+  return true;
+}
+
+/* Implement HARD_REGNO_CALLER_SAVE_MODE.
+ 
+   Which mode is required for saving NREGS of a pseudo-register in
+   call-clobbered hard register REGNO.  */
+
+machine_mode
+gcn_hard_regno_caller_save_mode (unsigned int regno, unsigned int nregs,
+				 machine_mode regmode)
+{
+  machine_mode result = choose_hard_reg_mode (regno, nregs, false);
+
+  if (VECTOR_MODE_P (result) && !VECTOR_MODE_P (regmode))
+    result = (nregs == 1 ? SImode : DImode);
+
+  return result;
+}
+
+/* Implement TARGET_ASM_TRAMPOLINE_TEMPLATE.
+
+   Output assembler code for a block containing the constant parts
+   of a trampoline, leaving space for the variable parts.  */
+
+static void
+gcn_asm_trampoline_template (FILE *f)
+{
+  /* The source operand of the move instructions must be a 32-bit
+     constant following the opcode.  */
+  asm_fprintf (f, "\ts_mov_b32\ts%i, 0xffff\n", STATIC_CHAIN_REGNUM);
+  asm_fprintf (f, "\ts_mov_b32\ts%i, 0xffff\n", STATIC_CHAIN_REGNUM + 1);
+  asm_fprintf (f, "\ts_mov_b32\ts%i, 0xffff\n", CC_SAVE_REG);
+  asm_fprintf (f, "\ts_mov_b32\ts%i, 0xffff\n", CC_SAVE_REG + 1);
+  asm_fprintf (f, "\ts_setpc_b64\ts[%i:%i]\n", CC_SAVE_REG, CC_SAVE_REG + 1);
+}
+
+/* Implement TARGET_TRAMPOLINE_INIT.
+
+   Emit RTL insns to initialize the variable parts of a trampoline.
+   FNDECL is the decl of the target address, M_TRAMP is a MEM for
+   the trampoline, and CHAIN_VALUE is an RTX for the static chain
+   to be passed to the target function.  */
+
+static void
+gcn_trampoline_init (rtx m_tramp, tree fndecl, rtx chain_value)
+{
+  emit_block_move (m_tramp, assemble_trampoline_template (),
+		   GEN_INT (TRAMPOLINE_SIZE), BLOCK_OP_NORMAL);
+
+  rtx fnaddr = XEXP (DECL_RTL (fndecl), 0);
+  rtx chain_value_reg = copy_to_reg (chain_value);
+  rtx fnaddr_reg = copy_to_reg (fnaddr);
+
+  for (int i = 0; i < 4; i++)
+    {
+      rtx mem = adjust_address (m_tramp, SImode, i * 8 + 4);
+      rtx reg = i < 2 ? chain_value_reg : fnaddr_reg;
+      emit_move_insn (mem, gen_rtx_SUBREG (SImode, reg, (i % 2) * 4));
+    }
+
+  rtx tramp_addr = XEXP (m_tramp, 0);
+  emit_insn (gen_clear_icache (tramp_addr,
+			       plus_constant (ptr_mode, tramp_addr,
+					      TRAMPOLINE_SIZE)));
+}
+
+/* }}}  */
+/* {{{ Miscellaneous.  */
+
+/* Implement TARGET_CANNOT_COPY_INSN_P.
+ 
+   Return true if INSN must not be duplicated.  */
+
+static bool
+gcn_cannot_copy_insn_p (rtx_insn *insn)
+{
+  if (recog_memoized (insn) == CODE_FOR_gcn_wavefront_barrier)
+    return true;
+
+  return false;
+}
+
+/* Implement TARGET_DEBUG_UNWIND_INFO.
+
+   Defines the mechanism that will be used for describing frame unwind
+   information to the debugger.  */
+
+static enum unwind_info_type
+gcn_debug_unwind_info ()
+{
+  /* No support for debug info, yet.  */
+  return UI_NONE;
+}
+
+/* Determine if there is a suitable hardware conversion instruction.
+   Used primarily by the machine description.  */
+
+bool
+gcn_valid_cvt_p (machine_mode from, machine_mode to, enum gcn_cvt_t op)
+{
+  if (VECTOR_MODE_P (from) != VECTOR_MODE_P (to))
+    return false;
+
+  if (VECTOR_MODE_P (from))
+    {
+      from = GET_MODE_INNER (from);
+      to = GET_MODE_INNER (to);
+    }
+
+  switch (op)
+    {
+    case fix_trunc_cvt:
+    case fixuns_trunc_cvt:
+      if (GET_MODE_CLASS (from) != MODE_FLOAT
+	  || GET_MODE_CLASS (to) != MODE_INT)
+	return false;
+      break;
+    case float_cvt:
+    case floatuns_cvt:
+      if (GET_MODE_CLASS (from) != MODE_INT
+	  || GET_MODE_CLASS (to) != MODE_FLOAT)
+	return false;
+      break;
+    case extend_cvt:
+      if (GET_MODE_CLASS (from) != MODE_FLOAT
+	  || GET_MODE_CLASS (to) != MODE_FLOAT
+	  || GET_MODE_SIZE (from) >= GET_MODE_SIZE (to))
+	return false;
+      break;
+    case trunc_cvt:
+      if (GET_MODE_CLASS (from) != MODE_FLOAT
+	  || GET_MODE_CLASS (to) != MODE_FLOAT
+	  || GET_MODE_SIZE (from) <= GET_MODE_SIZE (to))
+	return false;
+      break;
+    }
+
+  return ((to == HImode && from == HFmode)
+	  || (to == SImode && (from == SFmode || from == DFmode))
+	  || (to == HFmode && (from == HImode || from == SFmode))
+	  || (to == SFmode && (from == SImode || from == HFmode
+			       || from == DFmode))
+	  || (to == DFmode && (from == SImode || from == SFmode)));
+}
+
+/* Implement TARGET_LEGITIMATE_COMBINED_INSN.
+
+   Return false if the instruction is not appropriate as a combination of two
+   or more instructions.  */
+
+bool
+gcn_legitimate_combined_insn (rtx_insn *insn)
+{
+  rtx pat = PATTERN (insn);
+
+  /* The combine pass tends to strip (use (exec)) patterns from insns.  This
+     means it basically switches everything to use the *_scalar form of the
+     instructions, which is not helpful.  So, this function disallows such
+     combinations.  Unfortunately, this also disallows combinations of genuine
+     scalar-only patterns, but those only come from explicit expand code.
+
+     Possible solutions:
+     - Invent TARGET_LEGITIMIZE_COMBINED_INSN.
+     - Remove all (use (EXEC)) and rely on md_reorg with "exec" attribute.
+   */
+
+  switch (GET_CODE (pat))
+    {
+    case SET:
+      /* Vector mode patterns are fine.  */
+      if (VECTOR_MODE_P (GET_MODE (XEXP (pat, 0))))
+	return true;
+      /* Plain moves are fine (fixed up by md_reorg).  */
+      switch (GET_CODE (XEXP (pat, 1)))
+	{
+	case REG:
+	case SUBREG:
+	case MEM:
+	  return true;
+	default:
+	  /* Any other scalar operation should have been a parallel.  */
+	  return false;
+	}
+    case PARALLEL:
+      for (int i = 0; i < XVECLEN (pat, 0); i++)
+	{
+	  rtx subpat = XVECEXP (pat, 0, i);
+	  switch (GET_CODE (subpat))
+	    {
+	    case USE:
+	      /* FIXME: check it really is EXEC that is used.
+	         Does combine ever generate a pattern with a use?  */
+	      return true;
+	    case SET:
+	      /* Vector mode patterns are fine.  */
+	      if (VECTOR_MODE_P (GET_MODE (XEXP (pat, 0))))
+		return true;
+	    default:
+	      break;
+	    }
+	}
+      /* A suitable pattern was not found.  */
+      return false;
+    default:
+      return true;
+    }
+}
+
+/* Implement both TARGET_ASM_CONSTRUCTOR and TARGET_ASM_DESTRUCTOR.
+
+   The current loader does not support running code outside "main".  This
+   hook implementation can be replaced or removed when that changes.  */
+
+void
+gcn_disable_constructors (rtx symbol, int priority __attribute__ ((unused)))
+{
+  tree d = SYMBOL_REF_DECL (symbol);
+  location_t l = d ? DECL_SOURCE_LOCATION (d) : UNKNOWN_LOCATION;
+
+  sorry_at (l, "GCN does not support static constructors or destructors");
+}
+
+/* }}}  */
+/* {{{ Costs.  */
+
+/* Implement TARGET_RTX_COSTS.
+   
+   Compute a (partial) cost for rtx X.  Return true if the complete
+   cost has been computed, and false if subexpressions should be
+   scanned.  In either case, *TOTAL contains the cost result.  */
+
+static bool
+gcn_rtx_costs (rtx x, machine_mode, int, int, int *total, bool)
+{
+  enum rtx_code code = GET_CODE (x);
+  switch (code)
+    {
+    case CONST:
+    case CONST_DOUBLE:
+    case CONST_VECTOR:
+    case CONST_INT:
+      if (gcn_inline_constant_p (x))
+	*total = 0;
+      else if (code == CONST_INT
+	  && ((unsigned HOST_WIDE_INT) INTVAL (x) + 0x8000) < 0x10000)
+	*total = 1;
+      else if (gcn_constant_p (x))
+	*total = 2;
+      else
+	*total = vgpr_vector_mode_p (GET_MODE (x)) ? 64 : 4;
+      return true;
+
+    case DIV:
+      *total = 100;
+      return false;
+
+    default:
+      *total = 3;
+      return false;
+    }
+}
+
+/* Implement TARGET_MEMORY_MOVE_COST.
+   
+   Return the cost of moving data of mode M between a
+   register and memory.  A value of 2 is the default; this cost is
+   relative to those in `REGISTER_MOVE_COST'.
+
+   This function is used extensively by register_move_cost that is used to
+   build tables at startup.  Make it inline in this case.
+   When IN is 2, return maximum of in and out move cost.
+
+   If moving between registers and memory is more expensive than
+   between two registers, you should define this macro to express the
+   relative cost.
+
+   Model also increased moving costs of QImode registers in non
+   Q_REGS classes.  */
+
+#define LOAD_COST  32
+#define STORE_COST 32
+static int
+gcn_memory_move_cost (machine_mode mode, reg_class_t regclass, bool in)
+{
+  int nregs = CEIL (GET_MODE_SIZE (mode), 4);
+  switch (regclass)
+    {
+    case SCC_CONDITIONAL_REG:
+    case VCCZ_CONDITIONAL_REG:
+    case VCC_CONDITIONAL_REG:
+    case EXECZ_CONDITIONAL_REG:
+    case ALL_CONDITIONAL_REGS:
+    case SGPR_REGS:
+    case SGPR_EXEC_REGS:
+    case EXEC_MASK_REG:
+    case SGPR_VOP3A_SRC_REGS:
+    case SGPR_MEM_SRC_REGS:
+    case SGPR_SRC_REGS:
+    case SGPR_DST_REGS:
+    case GENERAL_REGS:
+    case AFP_REGS:
+      if (!in)
+	return (STORE_COST + 2) * nregs;
+      return LOAD_COST * nregs;
+    case VGPR_REGS:
+      if (in)
+	return (LOAD_COST + 2) * nregs;
+      return STORE_COST * nregs;
+    case ALL_REGS:
+    case ALL_GPR_REGS:
+    case SRCDST_REGS:
+      if (in)
+	return (LOAD_COST + 2) * nregs;
+      return (STORE_COST + 2) * nregs;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Implement TARGET_REGISTER_MOVE_COST.
+   
+   Return the cost of moving data from a register in class CLASS1 to
+   one in class CLASS2.  Base value is 2.  */
+
+static int
+gcn_register_move_cost (machine_mode, reg_class_t dst, reg_class_t src)
+{
+  /* Increase cost of moving from and to vector registers.  While this is
+     fast in hardware (I think), it has hidden cost of setting up the exec
+     flags.  */
+  if ((src < VGPR_REGS) != (dst < VGPR_REGS))
+    return 4;
+  return 2;
+}
+
+/* }}}  */
+/* {{{ Builtins.  */
+
+/* Type codes used by GCN built-in definitions.  */
+
+enum gcn_builtin_type_index
+{
+  GCN_BTI_END_OF_PARAMS,
+
+  GCN_BTI_VOID,
+  GCN_BTI_BOOL,
+  GCN_BTI_INT,
+  GCN_BTI_UINT,
+  GCN_BTI_SIZE_T,
+  GCN_BTI_LLINT,
+  GCN_BTI_LLUINT,
+  GCN_BTI_EXEC,
+
+  GCN_BTI_SF,
+  GCN_BTI_V64SI,
+  GCN_BTI_V64SF,
+  GCN_BTI_V64PTR,
+  GCN_BTI_SIPTR,
+  GCN_BTI_SFPTR,
+  GCN_BTI_VOIDPTR,
+
+  GCN_BTI_LDS_VOIDPTR,
+
+  GCN_BTI_MAX
+};
+
+static GTY(()) tree gcn_builtin_types[GCN_BTI_MAX];
+
+#define exec_type_node (gcn_builtin_types[GCN_BTI_EXEC])
+#define sf_type_node (gcn_builtin_types[GCN_BTI_SF])
+#define v64si_type_node (gcn_builtin_types[GCN_BTI_V64SI])
+#define v64sf_type_node (gcn_builtin_types[GCN_BTI_V64SF])
+#define v64ptr_type_node (gcn_builtin_types[GCN_BTI_V64PTR])
+#define siptr_type_node (gcn_builtin_types[GCN_BTI_SIPTR])
+#define sfptr_type_node (gcn_builtin_types[GCN_BTI_SFPTR])
+#define voidptr_type_node (gcn_builtin_types[GCN_BTI_VOIDPTR])
+#define size_t_type_node (gcn_builtin_types[GCN_BTI_SIZE_T])
+
+static rtx gcn_expand_builtin_1 (tree, rtx, rtx, machine_mode, int,
+				 struct gcn_builtin_description *);
+static rtx gcn_expand_builtin_binop (tree, rtx, rtx, machine_mode, int,
+				     struct gcn_builtin_description *);
+
+struct gcn_builtin_description;
+typedef rtx (*gcn_builtin_expander) (tree, rtx, rtx, machine_mode, int,
+				     struct gcn_builtin_description *);
+
+enum gcn_builtin_type
+{
+  B_UNIMPLEMENTED,		/* Sorry out */
+  B_INSN,			/* Emit a pattern */
+  B_OVERLOAD			/* Placeholder for an overloaded function */
+};
+
+struct gcn_builtin_description
+{
+  int fcode;
+  int icode;
+  const char *name;
+  enum gcn_builtin_type type;
+  /* The first element of parm is always the return type.  The rest
+     are a zero terminated list of parameters.  */
+  int parm[6];
+  gcn_builtin_expander expander;
+};
+
+/* Read in the GCN builtins from gcn-builtins.def.  */
+
+extern GTY(()) struct gcn_builtin_description gcn_builtins[GCN_BUILTIN_MAX];
+
+struct gcn_builtin_description gcn_builtins[] = {
+#define DEF_BUILTIN(fcode, icode, name, type, params, expander)	\
+  {GCN_BUILTIN_ ## fcode, icode, name, type, params, expander},
+
+#define DEF_BUILTIN_BINOP_INT_FP(fcode, ic, name)			\
+  {GCN_BUILTIN_ ## fcode ## _V64SI,					\
+   CODE_FOR_ ## ic ##v64si3_vector, name "_v64int", B_INSN,		\
+   {GCN_BTI_V64SI, GCN_BTI_EXEC, GCN_BTI_V64SI, GCN_BTI_V64SI,		\
+    GCN_BTI_V64SI, GCN_BTI_END_OF_PARAMS}, gcn_expand_builtin_binop},	\
+  {GCN_BUILTIN_ ## fcode ## _V64SI_unspec,				\
+   CODE_FOR_ ## ic ##v64si3_vector, name "_v64int_unspec", B_INSN, 	\
+   {GCN_BTI_V64SI, GCN_BTI_EXEC, GCN_BTI_V64SI, GCN_BTI_V64SI,		\
+    GCN_BTI_END_OF_PARAMS}, gcn_expand_builtin_binop},
+
+#include "gcn-builtins.def"
+#undef DEF_BUILTIN_BINOP_INT_FP
+#undef DEF_BUILTIN
+};
+
+static GTY(()) tree gcn_builtin_decls[GCN_BUILTIN_MAX];
+
+/* Implement TARGET_BUILTIN_DECL.
+   
+   Return the GCN builtin for CODE.  */
+
+tree
+gcn_builtin_decl (unsigned code, bool ARG_UNUSED (initialize_p))
+{
+  if (code >= GCN_BUILTIN_MAX)
+    return error_mark_node;
+
+  return gcn_builtin_decls[code];
+}
+
+/* Helper function for gcn_init_builtins.  */
+
+static void
+gcn_init_builtin_types (void)
+{
+  gcn_builtin_types[GCN_BTI_VOID] = void_type_node;
+  gcn_builtin_types[GCN_BTI_BOOL] = boolean_type_node;
+  gcn_builtin_types[GCN_BTI_INT] = intSI_type_node;
+  gcn_builtin_types[GCN_BTI_UINT] = unsigned_type_for (intSI_type_node);
+  gcn_builtin_types[GCN_BTI_SIZE_T] = size_type_node;
+  gcn_builtin_types[GCN_BTI_LLINT] = intDI_type_node;
+  gcn_builtin_types[GCN_BTI_LLUINT] = unsigned_type_for (intDI_type_node);
+
+  exec_type_node = unsigned_intDI_type_node;
+  sf_type_node = float32_type_node;
+  v64si_type_node = build_vector_type (intSI_type_node, 64);
+  v64sf_type_node = build_vector_type (float_type_node, 64);
+  v64ptr_type_node = build_vector_type (unsigned_intDI_type_node
+					/*build_pointer_type
+					  (integer_type_node) */
+					, 64);
+  tree tmp = build_distinct_type_copy (intSI_type_node);
+  TYPE_ADDR_SPACE (tmp) = ADDR_SPACE_FLAT;
+  siptr_type_node = build_pointer_type (tmp);
+
+  tmp = build_distinct_type_copy (float_type_node);
+  TYPE_ADDR_SPACE (tmp) = ADDR_SPACE_FLAT;
+  sfptr_type_node = build_pointer_type (tmp);
+
+  tmp = build_distinct_type_copy (void_type_node);
+  TYPE_ADDR_SPACE (tmp) = ADDR_SPACE_FLAT;
+  voidptr_type_node = build_pointer_type (tmp);
+
+  tmp = build_distinct_type_copy (void_type_node);
+  TYPE_ADDR_SPACE (tmp) = ADDR_SPACE_LDS;
+  gcn_builtin_types[GCN_BTI_LDS_VOIDPTR] = build_pointer_type (tmp);
+}
+
+/* Implement TARGET_INIT_BUILTINS.
+   
+   Set up all builtin functions for this target.  */
+
+static void
+gcn_init_builtins (void)
+{
+  gcn_init_builtin_types ();
+
+  struct gcn_builtin_description *d;
+  unsigned int i;
+  for (i = 0, d = gcn_builtins; i < GCN_BUILTIN_MAX; i++, d++)
+    {
+      tree p;
+      char name[64];		/* build_function will make a copy.  */
+      int parm;
+
+      /* FIXME: Is this necessary/useful? */
+      if (d->name == 0)
+	continue;
+
+      /* Find last parm.  */
+      for (parm = 1; d->parm[parm] != GCN_BTI_END_OF_PARAMS; parm++)
+	;
+
+      p = void_list_node;
+      while (parm > 1)
+	p = tree_cons (NULL_TREE, gcn_builtin_types[d->parm[--parm]], p);
+
+      p = build_function_type (gcn_builtin_types[d->parm[0]], p);
+
+      sprintf (name, "__builtin_gcn_%s", d->name);
+      gcn_builtin_decls[i]
+	= add_builtin_function (name, p, i, BUILT_IN_MD, NULL, NULL_TREE);
+
+      /* These builtins don't throw.  */
+      TREE_NOTHROW (gcn_builtin_decls[i]) = 1;
+    }
+
+/* FIXME: remove the ifdef once OpenACC support is merged upstream.  */
+#ifdef BUILT_IN_GOACC_SINGLE_START
+  /* These builtins need to take/return an LDS pointer: override the generic
+     versions here.  */
+
+  set_builtin_decl (BUILT_IN_GOACC_SINGLE_START,
+		    gcn_builtin_decls[GCN_BUILTIN_ACC_SINGLE_START], false);
+
+  set_builtin_decl (BUILT_IN_GOACC_SINGLE_COPY_START,
+		    gcn_builtin_decls[GCN_BUILTIN_ACC_SINGLE_COPY_START],
+		    false);
+
+  set_builtin_decl (BUILT_IN_GOACC_SINGLE_COPY_END,
+		    gcn_builtin_decls[GCN_BUILTIN_ACC_SINGLE_COPY_END],
+		    false);
+
+  set_builtin_decl (BUILT_IN_GOACC_BARRIER,
+		    gcn_builtin_decls[GCN_BUILTIN_ACC_BARRIER], false);
+#endif
+}
+
+/* Expand the CMP_SWAP GCN builtins.  We have our own versions that do
+   not require taking the address of any object, other than the memory
+   cell being operated on.
+ 
+   Helper function for gcn_expand_builtin_1.  */
+
+static rtx
+gcn_expand_cmp_swap (tree exp, rtx target)
+{
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  addr_space_t as
+    = TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (CALL_EXPR_ARG (exp, 0))));
+  machine_mode as_mode = gcn_addr_space_address_mode (as);
+
+  if (!target)
+    target = gen_reg_rtx (mode);
+
+  rtx addr = expand_expr (CALL_EXPR_ARG (exp, 0),
+			  NULL_RTX, as_mode, EXPAND_NORMAL);
+  rtx cmp = expand_expr (CALL_EXPR_ARG (exp, 1),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx src = expand_expr (CALL_EXPR_ARG (exp, 2),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx pat;
+
+  rtx mem = gen_rtx_MEM (mode, force_reg (as_mode, addr));
+  set_mem_addr_space (mem, as);
+
+  if (!REG_P (cmp))
+    cmp = copy_to_mode_reg (mode, cmp);
+  if (!REG_P (src))
+    src = copy_to_mode_reg (mode, src);
+
+  if (mode == SImode)
+    pat = gen_sync_compare_and_swapsi (target, mem, cmp, src);
+  else
+    pat = gen_sync_compare_and_swapdi (target, mem, cmp, src);
+
+  emit_insn (pat);
+
+  return target;
+}
+
+/* Expand many different builtins.
+
+   Intended for use in gcn-builtins.def.  */
+
+static rtx
+gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ ,
+		      machine_mode /*mode */ , int ignore,
+		      struct gcn_builtin_description *)
+{
+  tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
+  switch (DECL_FUNCTION_CODE (fndecl))
+    {
+    case GCN_BUILTIN_FLAT_LOAD_INT32:
+      {
+	if (ignore)
+	  return target;
+	/*rtx exec = */
+	force_reg (DImode,
+		   expand_expr (CALL_EXPR_ARG (exp, 0), NULL_RTX, DImode,
+				EXPAND_NORMAL));
+	/*rtx ptr = */
+	force_reg (V64DImode,
+		   expand_expr (CALL_EXPR_ARG (exp, 1), NULL_RTX, V64DImode,
+				EXPAND_NORMAL));
+	/*emit_insn (gen_vector_flat_loadv64si
+		     (target, gcn_gen_undef (V64SImode), ptr, exec)); */
+	return target;
+      }
+    case GCN_BUILTIN_FLAT_LOAD_PTR_INT32:
+    case GCN_BUILTIN_FLAT_LOAD_PTR_FLOAT:
+      {
+	if (ignore)
+	  return target;
+	rtx exec = force_reg (DImode,
+			      expand_expr (CALL_EXPR_ARG (exp, 0), NULL_RTX,
+					   DImode,
+					   EXPAND_NORMAL));
+	rtx ptr = force_reg (DImode,
+			     expand_expr (CALL_EXPR_ARG (exp, 1), NULL_RTX,
+					  V64DImode,
+					  EXPAND_NORMAL));
+	rtx offsets = force_reg (V64SImode,
+				 expand_expr (CALL_EXPR_ARG (exp, 2),
+					      NULL_RTX, V64DImode,
+					      EXPAND_NORMAL));
+	rtx addrs = gen_reg_rtx (V64DImode);
+	rtx tmp = gen_reg_rtx (V64SImode);
+	emit_insn (gen_ashlv64si3_vector (tmp, offsets,
+					  GEN_INT (2),
+					  exec, gcn_gen_undef (V64SImode)));
+	emit_insn (gen_addv64di3_zext_dup2 (addrs, tmp, ptr, exec,
+					    gcn_gen_undef (V64DImode)));
+	rtx mem = gen_rtx_MEM (GET_MODE (target), addrs);
+	/*set_mem_addr_space (mem, ADDR_SPACE_FLAT); */
+	/* FIXME: set attributes.  */
+	emit_insn (gen_mov_with_exec (target, mem, exec));
+	return target;
+      }
+    case GCN_BUILTIN_FLAT_STORE_PTR_INT32:
+    case GCN_BUILTIN_FLAT_STORE_PTR_FLOAT:
+      {
+	rtx exec = force_reg (DImode,
+			      expand_expr (CALL_EXPR_ARG (exp, 0), NULL_RTX,
+					   DImode,
+					   EXPAND_NORMAL));
+	rtx ptr = force_reg (DImode,
+			     expand_expr (CALL_EXPR_ARG (exp, 1), NULL_RTX,
+					  V64DImode,
+					  EXPAND_NORMAL));
+	rtx offsets = force_reg (V64SImode,
+				 expand_expr (CALL_EXPR_ARG (exp, 2),
+					      NULL_RTX, V64DImode,
+					      EXPAND_NORMAL));
+	machine_mode vmode = TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG (exp,
+								       3)));
+	rtx val = force_reg (vmode,
+			     expand_expr (CALL_EXPR_ARG (exp, 3), NULL_RTX,
+					  vmode,
+					  EXPAND_NORMAL));
+	rtx addrs = gen_reg_rtx (V64DImode);
+	rtx tmp = gen_reg_rtx (V64SImode);
+	emit_insn (gen_ashlv64si3_vector (tmp, offsets,
+					  GEN_INT (2),
+					  exec, gcn_gen_undef (V64SImode)));
+	emit_insn (gen_addv64di3_zext_dup2 (addrs, tmp, ptr, exec,
+					    gcn_gen_undef (V64DImode)));
+	rtx mem = gen_rtx_MEM (vmode, addrs);
+	/*set_mem_addr_space (mem, ADDR_SPACE_FLAT); */
+	/* FIXME: set attributes.  */
+	emit_insn (gen_mov_with_exec (mem, val, exec));
+	return target;
+      }
+    case GCN_BUILTIN_SQRTVF:
+      {
+	if (ignore)
+	  return target;
+	rtx exec = gcn_full_exec_reg ();
+	rtx arg = force_reg (V64SFmode,
+			     expand_expr (CALL_EXPR_ARG (exp, 0), NULL_RTX,
+					  V64SFmode,
+					  EXPAND_NORMAL));
+	emit_insn (gen_sqrtv64sf_vector
+		   (target, arg, exec, gcn_gen_undef (V64SFmode)));
+	return target;
+      }
+    case GCN_BUILTIN_SQRTF:
+      {
+	if (ignore)
+	  return target;
+	rtx exec = gcn_scalar_exec ();
+	rtx arg = force_reg (SFmode,
+			     expand_expr (CALL_EXPR_ARG (exp, 0), NULL_RTX,
+					  SFmode,
+					  EXPAND_NORMAL));
+	emit_insn (gen_sqrtsf_scalar (target, arg, exec));
+	return target;
+      }
+    case GCN_BUILTIN_OMP_DIM_SIZE:
+      {
+	if (ignore)
+	  return target;
+	emit_insn (gen_oacc_dim_size (target,
+				      expand_expr (CALL_EXPR_ARG (exp, 0),
+						   NULL_RTX, SImode,
+						   EXPAND_NORMAL)));
+	return target;
+      }
+    case GCN_BUILTIN_OMP_DIM_POS:
+      {
+	if (ignore)
+	  return target;
+	emit_insn (gen_oacc_dim_pos (target,
+				     expand_expr (CALL_EXPR_ARG (exp, 0),
+						  NULL_RTX, SImode,
+						  EXPAND_NORMAL)));
+	return target;
+      }
+    case GCN_BUILTIN_CMP_SWAP:
+    case GCN_BUILTIN_CMP_SWAPLL:
+      return gcn_expand_cmp_swap (exp, target);
+
+    case GCN_BUILTIN_ACC_SINGLE_START:
+      {
+	if (ignore)
+	  return target;
+
+	rtx wavefront = gcn_oacc_dim_pos (1);
+	rtx cond = gen_rtx_EQ (VOIDmode, wavefront, const0_rtx);
+	rtx cc = (target && REG_P (target)) ? target : gen_reg_rtx (BImode);
+	emit_insn (gen_cstoresi4 (cc, cond, wavefront, const0_rtx));
+	return cc;
+      }
+
+    case GCN_BUILTIN_ACC_SINGLE_COPY_START:
+      {
+	rtx blk = force_reg (SImode,
+			     expand_expr (CALL_EXPR_ARG (exp, 0), NULL_RTX,
+					  SImode, EXPAND_NORMAL));
+	rtx wavefront = gcn_oacc_dim_pos (1);
+	rtx cond = gen_rtx_NE (VOIDmode, wavefront, const0_rtx);
+	rtx not_zero = gen_label_rtx ();
+	emit_insn (gen_cbranchsi4 (cond, wavefront, const0_rtx, not_zero));
+	emit_move_insn (blk, const0_rtx);
+	emit_label (not_zero);
+	return blk;
+      }
+
+    case GCN_BUILTIN_ACC_SINGLE_COPY_END:
+      return target;
+
+    case GCN_BUILTIN_ACC_BARRIER:
+      emit_insn (gen_gcn_wavefront_barrier ());
+      return target;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Expansion of simple arithmetic and bit binary operation builtins.
+
+   Intended for use with gcn_builtins table.  */
+
+static rtx
+gcn_expand_builtin_binop (tree exp, rtx target, rtx /*subtarget */ ,
+			  machine_mode /*mode */ , int ignore,
+			  struct gcn_builtin_description *d)
+{
+  int icode = d->icode;
+  if (ignore)
+    return target;
+
+  rtx exec = force_reg (DImode,
+			expand_expr (CALL_EXPR_ARG (exp, 0), NULL_RTX, DImode,
+				     EXPAND_NORMAL));
+
+  machine_mode m1 = insn_data[icode].operand[1].mode;
+  rtx arg1 = expand_expr (CALL_EXPR_ARG (exp, 1), NULL_RTX, m1,
+			  EXPAND_NORMAL);
+  if (!insn_data[icode].operand[1].predicate (arg1, m1))
+    arg1 = force_reg (m1, arg1);
+
+  machine_mode m2 = insn_data[icode].operand[2].mode;
+  rtx arg2 = expand_expr (CALL_EXPR_ARG (exp, 2), NULL_RTX, m2,
+			  EXPAND_NORMAL);
+  if (!insn_data[icode].operand[2].predicate (arg2, m2))
+    arg2 = force_reg (m2, arg2);
+
+  rtx arg_prev;
+  if (call_expr_nargs (exp) == 4)
+    {
+      machine_mode m_prev = insn_data[icode].operand[4].mode;
+      arg_prev = force_reg (m_prev,
+			    expand_expr (CALL_EXPR_ARG (exp, 3), NULL_RTX,
+					 m_prev, EXPAND_NORMAL));
+    }
+  else
+    arg_prev = gcn_gen_undef (GET_MODE (target));
+
+  rtx pat = GEN_FCN (icode) (target, arg1, arg2, exec, arg_prev);
+  emit_insn (pat);
+  return target;
+}
+
+/* Implement TARGET_EXPAND_BUILTIN.
+   
+   Expand an expression EXP that calls a built-in function, with result going
+   to TARGET if that's convenient (and in mode MODE if that's convenient).
+   SUBTARGET may be used as the target for computing one of EXP's operands.
+   IGNORE is nonzero if the value is to be ignored.  */
+
+rtx
+gcn_expand_builtin (tree exp, rtx target, rtx subtarget, machine_mode mode,
+		    int ignore)
+{
+  tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
+  unsigned int fcode = DECL_FUNCTION_CODE (fndecl);
+  struct gcn_builtin_description *d;
+
+  gcc_assert (fcode < GCN_BUILTIN_MAX);
+  d = &gcn_builtins[fcode];
+
+  if (d->type == B_UNIMPLEMENTED)
+    sorry ("Builtin not implemented");
+
+  return d->expander (exp, target, subtarget, mode, ignore, d);
+}
+
+/* }}}  */
+/* {{{ Vectorization.  */
+
+/* Implement TARGET_VECTORIZE_GET_MASK_MODE.
+
+   The autovectorizer requires a vector type for the "truth vector" used for
+   fully masked loops, etc.  Therefore we must have minimal support for V64BI
+   mode (enough to enable the middle-end optimizations).
+
+   We convert V64BImode to DImode at expand time.  */
+
+opt_machine_mode
+gcn_vectorize_get_mask_mode (poly_uint64 ARG_UNUSED (nunits),
+			     poly_uint64 ARG_UNUSED (length))
+{
+  return V64BImode;
+}
+
+/* Convert vector mask mode.
+
+   The autovectorizer uses a V64BImode mask, but the backend uses DImode.
+   This is intended to convert the mask mode during the expand pass only.  */
+
+rtx
+gcn_convert_mask_mode (rtx x)
+{
+  gcc_assert (GET_MODE (x) == V64BImode);
+
+  if (REG_P (x) || SUBREG_P (x))
+    return simplify_gen_subreg (DImode, x, V64BImode, 0);
+  else if (GET_CODE (x) == CONST_VECTOR)
+    {
+      HOST_WIDE_INT mask = 0;
+      for (int i = 0; i < 64; i++)
+	mask |= (INTVAL (CONST_VECTOR_ELT (x, i)) ? (HOST_WIDE_INT)1 << i : 0);
+
+      return gen_rtx_CONST_INT (VOIDmode, mask);
+    }
+  else if (MEM_P (x))
+    {
+      rtx copy = shallow_copy_rtx (x);
+      PUT_MODE (copy, DImode);
+      return copy;
+    }
+  else
+    {
+      gcc_unreachable ();
+      return x;
+    }
+}
+
+/* Return an RTX that references a vector with the i-th lane containing
+   PERM[i]*4.
+ 
+   Helper function for gcn_vectorize_vec_perm_const.  */
+
+static rtx
+gcn_make_vec_perm_address (unsigned int *perm)
+{
+  rtx x = gen_reg_rtx (V64SImode);
+  emit_insn (gen_mov_with_exec (x, gcn_vec_constant (V64SImode, 0)));
+
+  /* Permutation addresses use byte addressing.  With each vector lane being
+     4 bytes wide, and with 64 lanes in total, only bits 2..7 are significant,
+     so only set those.
+
+     The permutation given to the vec_perm* patterns range from 0 to 2N-1 to
+     select between lanes in two vectors, but as the DS_BPERMUTE* instructions
+     only take one source vector, the most-significant bit can be ignored
+     here.  Instead, we can use EXEC masking to select the relevant part of
+     each source vector after they are permuted separately.  */
+  uint64_t bit_mask = 1 << 2;
+  for (int i = 2; i < 8; i++, bit_mask <<= 1)
+    {
+      uint64_t exec_mask = 0;
+      uint64_t lane_mask = 1;
+      for (int j = 0; j < 64; j++, lane_mask <<= 1)
+	if ((perm[j] * 4) & bit_mask)
+	  exec_mask |= lane_mask;
+
+      if (exec_mask)
+	emit_insn (gen_addv64si3_vector (x, x,
+					 gcn_vec_constant (V64SImode,
+							   bit_mask),
+					 get_exec (exec_mask), x));
+    }
+
+  return x;
+}
+
+/* Implement TARGET_VECTORIZE_VEC_PERM_CONST.
+ 
+   Return true if permutation with SEL is possible.
+   
+   If DST/SRC0/SRC1 are non-null, emit the instructions to perform the
+   permutations.  */
+
+static bool
+gcn_vectorize_vec_perm_const (machine_mode vmode, rtx dst,
+			      rtx src0, rtx src1,
+			      const vec_perm_indices & sel)
+{
+  unsigned int nelt = GET_MODE_NUNITS (vmode);
+
+  gcc_assert (VECTOR_MODE_P (vmode));
+  gcc_assert (nelt <= 64);
+  gcc_assert (sel.length () == nelt);
+
+  if (vmode == V64BImode)
+    {
+      /* This isn't a true vector, it's a bitmask.  */
+      return false;
+    }
+
+  if (!dst)
+    {
+      /* All vector permutations are possible on this architecture,
+         with varying degrees of efficiency depending on the permutation. */
+      return true;
+    }
+
+  unsigned int perm[64];
+  for (unsigned int i = 0; i < nelt; ++i)
+    perm[i] = sel[i] & (2 * nelt - 1);
+
+  /* Make life a bit easier by swapping operands if necessary so that
+     the first element always comes from src0.  */
+  if (perm[0] >= nelt)
+    {
+      rtx temp = src0;
+      src0 = src1;
+      src1 = temp;
+
+      for (unsigned int i = 0; i < nelt; ++i)
+	if (perm[i] < nelt)
+	  perm[i] += nelt;
+	else
+	  perm[i] -= nelt;
+    }
+
+  /* TODO: There are more efficient ways to implement certain permutations
+     using ds_swizzle_b32 and/or DPP.  Test for and expand them here, before
+     this more inefficient generic approach is used.  */
+
+  int64_t src1_lanes = 0;
+  int64_t lane_bit = 1;
+
+  for (unsigned int i = 0; i < nelt; ++i, lane_bit <<= 1)
+    {
+      /* Set the bits for lanes from src1.  */
+      if (perm[i] >= nelt)
+	src1_lanes |= lane_bit;
+    }
+
+  rtx addr = gcn_make_vec_perm_address (perm);
+  rtx (*ds_bpermute) (rtx, rtx, rtx, rtx);
+
+  switch (vmode)
+    {
+    case E_V64QImode:
+      ds_bpermute = gen_ds_bpermutev64qi;
+      break;
+    case E_V64HImode:
+      ds_bpermute = gen_ds_bpermutev64hi;
+      break;
+    case E_V64SImode:
+      ds_bpermute = gen_ds_bpermutev64si;
+      break;
+    case E_V64HFmode:
+      ds_bpermute = gen_ds_bpermutev64hf;
+      break;
+    case E_V64SFmode:
+      ds_bpermute = gen_ds_bpermutev64sf;
+      break;
+    case E_V64DImode:
+      ds_bpermute = gen_ds_bpermutev64di;
+      break;
+    case E_V64DFmode:
+      ds_bpermute = gen_ds_bpermutev64df;
+      break;
+    default:
+      gcc_assert (false);
+    }
+
+  /* Load elements from src0 to dst.  */
+  gcc_assert (~src1_lanes);
+  emit_insn (ds_bpermute (dst, addr, src0, gcn_full_exec_reg ()));
+
+  /* Load elements from src1 to dst.  */
+  if (src1_lanes)
+    {
+      /* Masking a lane masks both the destination and source lanes for
+         DS_BPERMUTE, so we need to have all lanes enabled for the permute,
+         then add an extra masked move to merge the results of permuting
+         the two source vectors together.
+       */
+      rtx tmp = gen_reg_rtx (vmode);
+      emit_insn (ds_bpermute (tmp, addr, src1, gcn_full_exec_reg ()));
+      emit_insn (gen_mov_with_exec (dst, tmp, get_exec (src1_lanes)));
+    }
+
+  return true;
+}
+
+/* Implements TARGET_VECTOR_MODE_SUPPORTED_P.
+ 
+   Return nonzero if vector MODE is supported with at least move
+   instructions.  */
+
+static bool
+gcn_vector_mode_supported_p (machine_mode mode)
+{
+  /* FIXME: Enable V64QImode and V64HImode.
+	    We should support these modes, but vector operations are usually
+	    assumed to automatically truncate types, and GCN does not.  We
+	    need to add explicit truncates and/or use SDWA for QI/HI insns.  */
+  return (/* mode == V64QImode || mode == V64HImode
+	  ||*/ mode == V64SImode || mode == V64DImode
+	  || mode == V64SFmode || mode == V64DFmode
+	  /* For the mask mode only.  */
+	  || mode == V64BImode);
+}
+
+/* Implement TARGET_VECTORIZE_PREFERRED_SIMD_MODE.
+
+   Enables autovectorization for all supported modes.  */
+
+static machine_mode
+gcn_vectorize_preferred_simd_mode (scalar_mode mode)
+{
+  switch (mode)
+    {
+    case E_QImode:
+      return V64QImode;
+    case E_HImode:
+      return V64HImode;
+    case E_SImode:
+      return V64SImode;
+    case E_DImode:
+      return V64DImode;
+    case E_SFmode:
+      return V64SFmode;
+    case E_DFmode:
+      return V64DFmode;
+    default:
+      return word_mode;
+    }
+}
+
+/* Implement TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT.
+
+   Return true if the target supports misaligned vector store/load of a
+   specific factor denoted in the misalignment parameter.  */
+
+static bool
+gcn_vectorize_support_vector_misalignment (machine_mode ARG_UNUSED (mode),
+					   const_tree type, int misalignment,
+					   bool is_packed)
+{
+  if (is_packed)
+    return false;
+
+  /* If the misalignment is unknown, we should be able to handle the access
+     so long as it is not to a member of a packed data structure.  */
+  if (misalignment == -1)
+    return true;
+
+  /* Return true if the misalignment is a multiple of the natural alignment
+     of the vector's element type.  This is probably always going to be
+     true in practice, since we've already established that this isn't a
+     packed access.  */
+  return misalignment % TYPE_ALIGN_UNIT (type) == 0;
+}
+
+/* Implement TARGET_VECTORIZE_VECTOR_ALIGNMENT_REACHABLE.
+
+   Return true if vector alignment is reachable (by peeling N iterations) for
+   the given scalar type TYPE.  */
+
+static bool
+gcn_vector_alignment_reachable (const_tree ARG_UNUSED (type), bool is_packed)
+{
+  /* Vectors which aren't in packed structures will not be less aligned than
+     the natural alignment of their element type, so this is safe.  */
+  return !is_packed;
+}
+
+/* Generate DPP instructions used for vector reductions.
+
+   The opcode is given by INSN.
+   The first operand of the operation is shifted right by SHIFT vector lanes.
+   SHIFT must be a power of 2.  If SHIFT is 16, the 15th lane of each row is
+   broadcast the next row (thereby acting like a shift of 16 for the end of
+   each row).  If SHIFT is 32, lane 31 is broadcast to all the
+   following lanes (thereby acting like a shift of 32 for lane 63).  */
+
+char *
+gcn_expand_dpp_shr_insn (machine_mode mode, const char *insn,
+			 int unspec, int shift)
+{
+  static char buf[64];
+  const char *dpp;
+  const char *vcc_in = "";
+  const char *vcc_out = "";
+
+  /* Add the vcc operand if needed.  */
+  if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+    {
+      if (unspec == UNSPEC_PLUS_CARRY_IN_DPP_SHR)
+	vcc_in = ", vcc";
+
+      if (unspec == UNSPEC_PLUS_CARRY_DPP_SHR
+	  || unspec == UNSPEC_PLUS_CARRY_IN_DPP_SHR)
+	vcc_out = ", vcc";
+    }
+
+  /* Add the DPP modifiers.  */
+  switch (shift)
+    {
+    case 1:
+      dpp = "row_shr:1 bound_ctrl:0";
+      break;
+    case 2:
+      dpp = "row_shr:2 bound_ctrl:0";
+      break;
+    case 4:
+      dpp = "row_shr:4 bank_mask:0xe";
+      break;
+    case 8:
+      dpp = "row_shr:8 bank_mask:0xc";
+      break;
+    case 16:
+      dpp = "row_bcast:15 row_mask:0xa";
+      break;
+    case 32:
+      dpp = "row_bcast:31 row_mask:0xc";
+      break;
+    default:
+      gcc_unreachable ();
+    }
+
+  sprintf (buf, "%s\t%%0%s, %%1, %%2%s %s", insn, vcc_out, vcc_in, dpp);
+
+  return buf;
+}
+
+/* Generate vector reductions in terms of DPP instructions.
+
+   The vector register SRC of mode MODE is reduced using the operation given
+   by UNSPEC, and the scalar result is returned in lane 63 of a vector
+   register.  */
+
+rtx
+gcn_expand_reduc_scalar (machine_mode mode, rtx src, int unspec)
+{
+  rtx tmp = gen_reg_rtx (mode);
+  bool use_plus_carry = unspec == UNSPEC_PLUS_DPP_SHR
+			&& GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+			&& (TARGET_GCN3 || mode == V64DImode);
+
+  if (use_plus_carry)
+    unspec = UNSPEC_PLUS_CARRY_DPP_SHR;
+
+  /* Perform reduction by first performing the reduction operation on every
+     pair of lanes, then on every pair of results from the previous
+     iteration (thereby effectively reducing every 4 lanes) and so on until
+     all lanes are reduced.  */
+  for (int i = 0, shift = 1; i < 6; i++, shift <<= 1)
+    {
+      rtx shift_val = gen_rtx_CONST_INT (VOIDmode, shift);
+      rtx insn = gen_rtx_SET (tmp,
+			      gen_rtx_UNSPEC (mode,
+					      gen_rtvec (3,
+							 src, src, shift_val),
+					      unspec));
+
+      /* Add clobber for instructions that set the carry flags.  */
+      if (use_plus_carry)
+	{
+	  rtx clobber = gen_rtx_CLOBBER (VOIDmode,
+					 gen_rtx_REG (DImode, VCC_REG));
+	  insn = gen_rtx_PARALLEL (VOIDmode,
+				   gen_rtvec (2, insn, clobber));
+	}
+
+      emit_insn (insn);
+
+      /* The source operands for every iteration after the first
+	   should be TMP.  */
+      src = tmp;
+    }
+
+  return tmp;
+}
+
+/* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
+
+int
+gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
+			tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
+{
+  /* Always vectorize.  */
+  return 1;
+}
+
+/* }}}  */
+/* {{{ md_reorg pass.  */
+
+/* Identify VMEM instructions from their "type" attribute.  */
+
+static bool
+gcn_vmem_insn_p (attr_type type)
+{
+  switch (type)
+    {
+    case TYPE_MUBUF:
+    case TYPE_MTBUF:
+    case TYPE_FLAT:
+      return true;
+    case TYPE_UNKNOWN:
+    case TYPE_SOP1:
+    case TYPE_SOP2:
+    case TYPE_SOPK:
+    case TYPE_SOPC:
+    case TYPE_SOPP:
+    case TYPE_SMEM:
+    case TYPE_DS:
+    case TYPE_VOP2:
+    case TYPE_VOP1:
+    case TYPE_VOPC:
+    case TYPE_VOP3A:
+    case TYPE_VOP3B:
+    case TYPE_VOP_SDWA:
+    case TYPE_VOP_DPP:
+    case TYPE_MULT:
+    case TYPE_VMULT:
+      return false;
+    }
+  gcc_unreachable ();
+  return false;
+}
+
+/* If INSN sets the EXEC register to a constant value, return the value,
+   otherwise return zero.  */
+
+static
+int64_t gcn_insn_exec_value (rtx_insn *insn)
+{
+  if (!NONDEBUG_INSN_P (insn))
+    return 0;
+
+  rtx pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == SET)
+    {
+      rtx dest = XEXP (pattern, 0);
+      rtx src = XEXP (pattern, 1);
+
+      if (GET_MODE (dest) == DImode
+	  && REG_P (dest) && REGNO (dest) == EXEC_REG
+	  && CONST_INT_P (src))
+	return INTVAL (src);
+    }
+
+  return 0;
+}
+
+/* Sets the EXEC register before INSN to the value that it had after
+   LAST_EXEC_DEF.  The constant value of the EXEC register is returned if
+   known, otherwise it returns zero.  */
+
+static
+int64_t gcn_restore_exec (rtx_insn *insn, rtx_insn *last_exec_def,
+			  int64_t curr_exec, bool curr_exec_known,
+			  bool &last_exec_def_saved)
+{
+  rtx exec_reg = gen_rtx_REG (DImode, EXEC_REG);
+  rtx exec;
+
+  int64_t exec_value = gcn_insn_exec_value (last_exec_def);
+
+  if (exec_value)
+    {
+      /* If the EXEC value is a constant and it happens to be the same as the
+         current EXEC value, the restore can be skipped.  */
+      if (curr_exec_known && exec_value == curr_exec)
+	return exec_value;
+
+      exec = GEN_INT (exec_value);
+    }
+  else
+    {
+      /* If the EXEC value is not a constant, save it in a register after the
+	 point of definition.  */
+      rtx exec_save_reg = gen_rtx_REG (DImode, EXEC_SAVE_REG);
+
+      if (!last_exec_def_saved)
+	{
+	  start_sequence ();
+	  emit_insn (gen_move_insn (exec_save_reg, exec_reg));
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  emit_insn_after (seq, last_exec_def);
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Saving EXEC after insn %d.\n",
+		     INSN_UID (last_exec_def));
+
+	  last_exec_def_saved = true;
+	}
+
+      exec = exec_save_reg;
+    }
+
+  /* Restore EXEC register before the usage.  */
+  start_sequence ();
+  emit_insn (gen_move_insn (exec_reg, exec));
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+  emit_insn_before (seq, insn);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      if (exec_value)
+	fprintf (dump_file, "Restoring EXEC to %ld before insn %d.\n",
+		 exec_value, INSN_UID (insn));
+      else
+	fprintf (dump_file,
+		 "Restoring EXEC from saved value before insn %d.\n",
+		 INSN_UID (insn));
+    }
+
+  return exec_value;
+}
+
+/* Implement TARGET_MACHINE_DEPENDENT_REORG.
+
+   Ensure that pipeline dependencies and lane masking are set correctly.  */
+
+static void
+gcn_md_reorg (void)
+{
+  basic_block bb;
+  rtx exec_reg = gen_rtx_REG (DImode, EXEC_REG);
+  rtx exec_lo_reg = gen_rtx_REG (SImode, EXEC_LO_REG);
+  rtx exec_hi_reg = gen_rtx_REG (SImode, EXEC_HI_REG);
+  regset_head live;
+
+  INIT_REG_SET (&live);
+
+  compute_bb_for_insn ();
+
+  if (!optimize)
+    {
+      split_all_insns ();
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "After split:\n");
+	  print_rtl_with_bb (dump_file, get_insns (), dump_flags);
+	}
+
+      /* Update data-flow information for split instructions.  */
+      df_insn_rescan_all ();
+    }
+
+  df_analyze ();
+
+  /* This pass ensures that the EXEC register is set correctly, according
+     to the "exec" attribute.  However, care must be taken so that the
+     value that reaches explicit uses of the EXEC register remains the
+     same as before.
+   */
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "BB %d:\n", bb->index);
+
+      rtx_insn *insn, *curr;
+      rtx_insn *last_exec_def = BB_HEAD (bb);
+      bool last_exec_def_saved = false;
+      bool curr_exec_explicit = true;
+      bool curr_exec_known = true;
+      int64_t curr_exec = 0;	/* 0 here means 'the value is that of EXEC
+				   after last_exec_def is executed'.  */
+
+      FOR_BB_INSNS_SAFE (bb, insn, curr)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  if (GET_CODE (PATTERN (insn)) == USE
+	      || GET_CODE (PATTERN (insn)) == CLOBBER)
+	    continue;
+
+	  /* Check the instruction for implicit setting of EXEC via an
+	     attribute.  */
+	  attr_exec exec_attr = get_attr_exec (insn);
+	  int64_t new_exec;
+
+	  switch (exec_attr)
+	    {
+	    case EXEC_SINGLE:
+	      /* Instructions that do not involve memory accesses only require
+		 bit 0 of EXEC to be set.  */
+	      if (gcn_vmem_insn_p (get_attr_type (insn))
+		  || get_attr_type (insn) == TYPE_DS)
+		new_exec = 1;
+	      else
+		new_exec = curr_exec | 1;
+	      break;
+
+	    case EXEC_FULL:
+	      new_exec = -1;
+	      break;
+
+	    default:
+	      new_exec = 0;
+	      break;
+	    }
+
+	  if (new_exec && (!curr_exec_known || new_exec != curr_exec))
+	    {
+	      start_sequence ();
+	      emit_insn (gen_move_insn (exec_reg, GEN_INT (new_exec)));
+	      rtx_insn *seq = get_insns ();
+	      end_sequence ();
+	      emit_insn_before (seq, insn);
+
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file, "Setting EXEC to %ld before insn %d.\n",
+			 new_exec, INSN_UID (insn));
+
+	      curr_exec = new_exec;
+	      curr_exec_explicit = false;
+	      curr_exec_known = true;
+	    }
+
+	  /* The state of the EXEC register is unknown after a
+	     function call.  */
+	  if (CALL_P (insn))
+	    curr_exec_known = false;
+
+	  bool exec_lo_def_p = reg_set_p (exec_lo_reg, PATTERN (insn));
+	  bool exec_hi_def_p = reg_set_p (exec_hi_reg, PATTERN (insn));
+	  bool exec_used = reg_referenced_p (exec_reg, PATTERN (insn));
+
+	  /* Handle explicit uses of EXEC.  If the instruction is a partial
+	     explicit definition of EXEC, then treat it as an explicit use of
+	     EXEC as well.  */
+	  if (exec_used || exec_lo_def_p != exec_hi_def_p)
+	    {
+	      /* An instruction that explicitly uses EXEC should not also
+		 implicitly define it.  */
+	      gcc_assert (!exec_used || !new_exec);
+
+	      if (!curr_exec_known || !curr_exec_explicit)
+		{
+		  /* Restore the previous explicitly defined value.  */
+		  curr_exec = gcn_restore_exec (insn, last_exec_def,
+						curr_exec, curr_exec_known,
+						last_exec_def_saved);
+		  curr_exec_explicit = true;
+		  curr_exec_known = true;
+		}
+	    }
+
+	  /* Handle explicit definitions of EXEC.  */
+	  if (exec_lo_def_p || exec_hi_def_p)
+	    {
+	      last_exec_def = insn;
+	      last_exec_def_saved = false;
+	      curr_exec = gcn_insn_exec_value (insn);
+	      curr_exec_explicit = true;
+	      curr_exec_known = true;
+
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file,
+			 "Found %s definition of EXEC at insn %d.\n",
+			 exec_lo_def_p == exec_hi_def_p ? "full" : "partial",
+			 INSN_UID (insn));
+	    }
+	}
+
+      COPY_REG_SET (&live, DF_LR_OUT (bb));
+      df_simulate_initialize_backwards (bb, &live);
+
+      /* If EXEC is live after the basic block, restore the value of EXEC
+	 at the end of the block.  */
+      if ((REGNO_REG_SET_P (&live, EXEC_LO_REG)
+	   || REGNO_REG_SET_P (&live, EXEC_HI_REG))
+	  && (!curr_exec_known || !curr_exec_explicit))
+	{
+	  rtx_insn *end_insn = BB_END (bb);
+
+	  /* If the instruction is not a jump instruction, do the restore
+	     after the last instruction in the basic block.  */
+	  if (NONJUMP_INSN_P (end_insn))
+	    end_insn = NEXT_INSN (end_insn);
+
+	  gcn_restore_exec (end_insn, last_exec_def, curr_exec,
+			    curr_exec_known, last_exec_def_saved);
+	}
+    }
+
+  CLEAR_REG_SET (&live);
+
+  /* "Manually Inserted Wait States (NOPs)."
+   
+     GCN hardware detects most kinds of register dependencies, but there
+     are some exceptions documented in the ISA manual.  This pass
+     detects the missed cases, and inserts the documented number of NOPs
+     required for correct execution.  */
+
+  const int max_waits = 5;
+  struct ilist
+  {
+    rtx_insn *insn;
+    attr_unit unit;
+    HARD_REG_SET writes;
+    int age;
+  } back[max_waits];
+  int oldest = 0;
+  for (int i = 0; i < max_waits; i++)
+    back[i].insn = NULL;
+
+  rtx_insn *insn, *last_insn = NULL;
+  for (insn = get_insns (); insn != 0; insn = NEXT_INSN (insn))
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      if (GET_CODE (PATTERN (insn)) == USE
+	  || GET_CODE (PATTERN (insn)) == CLOBBER)
+	continue;
+
+      attr_type itype = get_attr_type (insn);
+      attr_unit iunit = get_attr_unit (insn);
+      HARD_REG_SET ireads, iwrites;
+      CLEAR_HARD_REG_SET (ireads);
+      CLEAR_HARD_REG_SET (iwrites);
+      note_stores (PATTERN (insn), record_hard_reg_sets, &iwrites);
+      note_uses (&PATTERN (insn), record_hard_reg_uses, &ireads);
+
+      /* Scan recent previous instructions for dependencies not handled in
+         hardware.  */
+      int nops_rqd = 0;
+      for (int i = oldest; i < oldest + max_waits; i++)
+	{
+	  struct ilist *prev_insn = &back[i % max_waits];
+
+	  if (!prev_insn->insn)
+	    continue;
+
+	  /* VALU writes SGPR followed by VMEM reading the same SGPR
+	     requires 5 wait states.  */
+	  if ((prev_insn->age + nops_rqd) < 5
+	      && prev_insn->unit == UNIT_VECTOR
+	      && gcn_vmem_insn_p (itype))
+	    {
+	      HARD_REG_SET regs;
+	      COPY_HARD_REG_SET (regs, prev_insn->writes);
+	      AND_HARD_REG_SET (regs, ireads);
+	      if (hard_reg_set_intersect_p
+		  (regs, reg_class_contents[(int) SGPR_REGS]))
+		nops_rqd = 5 - prev_insn->age;
+	    }
+
+	  /* VALU sets VCC/EXEC followed by VALU uses VCCZ/EXECZ
+	     requires 5 wait states.  */
+	  if ((prev_insn->age + nops_rqd) < 5
+	      && prev_insn->unit == UNIT_VECTOR
+	      && iunit == UNIT_VECTOR
+	      && ((hard_reg_set_intersect_p
+		   (prev_insn->writes,
+		    reg_class_contents[(int) EXEC_MASK_REG])
+		   && TEST_HARD_REG_BIT (ireads, EXECZ_REG))
+		  ||
+		  (hard_reg_set_intersect_p
+		   (prev_insn->writes,
+		    reg_class_contents[(int) VCC_CONDITIONAL_REG])
+		   && TEST_HARD_REG_BIT (ireads, VCCZ_REG))))
+	    nops_rqd = 5 - prev_insn->age;
+
+	  /* VALU writes SGPR/VCC followed by v_{read,write}lane using
+	     SGPR/VCC as lane select requires 4 wait states.  */
+	  if ((prev_insn->age + nops_rqd) < 4
+	      && prev_insn->unit == UNIT_VECTOR
+	      && get_attr_laneselect (insn) == LANESELECT_YES)
+	    {
+	      HARD_REG_SET regs;
+	      COPY_HARD_REG_SET (regs, prev_insn->writes);
+	      AND_HARD_REG_SET (regs, ireads);
+	      if (hard_reg_set_intersect_p
+		  (regs, reg_class_contents[(int) SGPR_REGS])
+		  || hard_reg_set_intersect_p
+		     (regs, reg_class_contents[(int) VCC_CONDITIONAL_REG]))
+		nops_rqd = 4 - prev_insn->age;
+	    }
+
+	  /* VALU writes VGPR followed by VALU_DPP reading that VGPR
+	     requires 2 wait states.  */
+	  if ((prev_insn->age + nops_rqd) < 2
+	      && prev_insn->unit == UNIT_VECTOR
+	      && itype == TYPE_VOP_DPP)
+	    {
+	      HARD_REG_SET regs;
+	      COPY_HARD_REG_SET (regs, prev_insn->writes);
+	      AND_HARD_REG_SET (regs, ireads);
+	      if (hard_reg_set_intersect_p
+		  (regs, reg_class_contents[(int) VGPR_REGS]))
+		nops_rqd = 2 - prev_insn->age;
+	    }
+	}
+
+      /* Insert the required number of NOPs.  */
+      for (int i = nops_rqd; i > 0; i--)
+	emit_insn_after (gen_nop (), last_insn);
+
+      /* Age the previous instructions.  We can also ignore writes to
+         registers subsequently overwritten.  */
+      HARD_REG_SET written;
+      CLEAR_HARD_REG_SET (written);
+      for (int i = oldest + max_waits - 1; i > oldest; i--)
+	{
+	  struct ilist *prev_insn = &back[i % max_waits];
+
+	  /* Assume all instructions are equivalent to one "wait", the same
+	     as s_nop.  This is probably true for SALU, but not VALU (which
+	     may take longer), so this is not optimal.  However, AMD do
+	     not publish the cycle times for instructions.  */
+	  prev_insn->age += 1 + nops_rqd;
+
+	  IOR_HARD_REG_SET (written, iwrites);
+	  AND_COMPL_HARD_REG_SET (prev_insn->writes, written);
+	}
+
+      /* Track the current instruction as a previous instruction.  */
+      back[oldest].insn = insn;
+      back[oldest].unit = iunit;
+      COPY_HARD_REG_SET (back[oldest].writes, iwrites);
+      back[oldest].age = 0;
+      oldest = (oldest + 1) % max_waits;
+
+      last_insn = insn;
+    }
+}
+
+/* }}}  */
+/* {{{ OpenACC / OpenMP.  */
+
+#define GCN_DEFAULT_GANGS 0	/* Choose at runtime.  */
+#define GCN_DEFAULT_WORKERS 0	/* Choose at runtime.  */
+#define GCN_DEFAULT_VECTORS 1	/* Use autovectorization only, for now.  */
+
+/* Implement TARGET_GOACC_VALIDATE_DIMS.
+
+   Check the launch dimensions provided for an OpenACC compute
+   region, or routine.  */
+
+static bool
+gcn_goacc_validate_dims (tree decl, int dims[], int fn_level)
+{
+  bool changed = false;
+
+  /* FIXME: remove -facc-experimental-workers when they're ready.  */
+  int max_workers = flag_worker_partitioning ? 4 : 1;
+
+  /* The vector size must appear to be 64, to the user, unless this is a
+     SEQ routine.  The real, internal value is always 1, which means use
+     autovectorization, but the user should not see that.  */
+  if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
+      && dims[GOMP_DIM_VECTOR] >= 0)
+    {
+      if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0
+	  && dims[GOMP_DIM_VECTOR] != 64)
+	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
+		    OPT_Wopenacc_dims,
+		    (dims[GOMP_DIM_VECTOR]
+		     ? "using vector_length (64), ignoring %d"
+		     : "using vector_length (64), ignoring runtime setting"),
+		    dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = 1;
+      changed = true;
+    }
+
+  /* Check the num workers is not too large.  */
+  if (dims[GOMP_DIM_WORKER] > max_workers)
+    {
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
+		  OPT_Wopenacc_dims,
+		  "using num_workers (%d), ignoring %d",
+		  max_workers, dims[GOMP_DIM_WORKER]);
+      dims[GOMP_DIM_WORKER] = max_workers;
+      changed = true;
+    }
+
+  /* Set global defaults.  */
+  if (!decl)
+    {
+      dims[GOMP_DIM_VECTOR] = GCN_DEFAULT_VECTORS;
+      if (dims[GOMP_DIM_WORKER] < 0)
+	dims[GOMP_DIM_WORKER] = (flag_worker_partitioning
+				 ? GCN_DEFAULT_WORKERS : 1);
+      if (dims[GOMP_DIM_GANG] < 0)
+	dims[GOMP_DIM_GANG] = GCN_DEFAULT_GANGS;
+      changed = true;
+    }
+
+  return changed;
+}
+
+/* Helper function for oacc_dim_size instruction.
+   Also used for OpenMP, via builtin_gcn_dim_size, and the omp_gcn pass.  */
+
+rtx
+gcn_oacc_dim_size (int dim)
+{
+  if (dim < 0 || dim > 2)
+    error ("offload dimension out of range (%d)", dim);
+
+  /* Vectors are a special case.  */
+  if (dim == 2)
+    return const1_rtx;		/* Think of this as 1 times 64.  */
+
+  static int offset[] = {
+    /* Offsets into dispatch packet.  */
+    12,				/* X dim = Gang / Team / Work-group.  */
+    20,				/* Z dim = Worker / Thread / Wavefront.  */
+    16				/* Y dim = Vector / SIMD / Work-item.  */
+  };
+  rtx addr = gen_rtx_PLUS (DImode,
+			   gen_rtx_REG (DImode,
+					cfun->machine->args.
+					reg[DISPATCH_PTR_ARG]),
+			   GEN_INT (offset[dim]));
+  return gen_rtx_MEM (SImode, addr);
+}
+
+/* Helper function for oacc_dim_pos instruction.
+   Also used for OpenMP, via builtin_gcn_dim_pos, and the omp_gcn pass.  */
+
+rtx
+gcn_oacc_dim_pos (int dim)
+{
+  if (dim < 0 || dim > 2)
+    error ("offload dimension out of range (%d)", dim);
+
+  static const int reg[] = {
+    WORKGROUP_ID_X_ARG,		/* Gang / Team / Work-group.  */
+    WORK_ITEM_ID_Z_ARG,		/* Worker / Thread / Wavefront.  */
+    WORK_ITEM_ID_Y_ARG		/* Vector / SIMD / Work-item.  */
+  };
+
+  int reg_num = cfun->machine->args.reg[reg[dim]];
+
+  /* The information must have been requested by the kernel.  */
+  gcc_assert (reg_num >= 0);
+
+  return gen_rtx_REG (SImode, reg_num);
+}
+
+/* Implement TARGET_GOACC_FORK_JOIN.  */
+
+static bool
+gcn_fork_join (gcall *ARG_UNUSED (call), const int *ARG_UNUSED (dims),
+	       bool ARG_UNUSED (is_fork))
+{
+  /* GCN does not use the fork/join concept invented for NVPTX.
+     Instead we use standard autovectorization.  */
+  return false;
+}
+
+/* Implement ???????
+   FIXME make this a real hook.
+ 
+   Adjust FNDECL such that options inherited from the host compiler
+   are made appropriate for the accelerator compiler.  */
+
+void
+gcn_fixup_accel_lto_options (tree fndecl)
+{
+  tree func_optimize = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (fndecl);
+  if (!func_optimize)
+    return;
+
+  tree old_optimize = build_optimization_node (&global_options);
+  tree new_optimize;
+
+  /* If the function changed the optimization levels as well as
+     setting target options, start with the optimizations
+     specified.  */
+  if (func_optimize != old_optimize)
+    cl_optimization_restore (&global_options,
+			     TREE_OPTIMIZATION (func_optimize));
+
+  gcn_option_override ();
+
+  /* The target attributes may also change some optimization flags,
+     so update the optimization options if necessary.  */
+  new_optimize = build_optimization_node (&global_options);
+
+  if (old_optimize != new_optimize)
+    {
+      DECL_FUNCTION_SPECIFIC_OPTIMIZATION (fndecl) = new_optimize;
+      cl_optimization_restore (&global_options,
+			       TREE_OPTIMIZATION (old_optimize));
+    }
+}
+
+/* }}}  */
+/* {{{ ASM Output.  */
+
+/*  Implement TARGET_ASM_FILE_START.
+ 
+    Print assembler file header text.  */
+
+static void
+output_file_start (void)
+{
+  fprintf (asm_out_file, "\t.text\n");
+  fprintf (asm_out_file, "\t.hsa_code_object_version 2,0\n");
+  fprintf (asm_out_file, "\t.hsa_code_object_isa\n");	/* Autodetect.  */
+  fprintf (asm_out_file, "\t.section\t.AMDGPU.config\n");
+  fprintf (asm_out_file, "\t.text\n");
+}
+
+/* Implement ASM_DECLARE_FUNCTION_NAME via gcn-hsa.h.
+   
+   Print the initial definition of a function name.
+ 
+   For GCN kernel entry points this includes all the HSA meta-data, special
+   alignment constraints that don't apply to regular functions, and magic
+   comments that pass information to mkoffload.  */
+
+void
+gcn_hsa_declare_function_name (FILE *file, const char *name, tree)
+{
+  int sgpr, vgpr;
+  bool xnack_enabled = false;
+  int extra_regs = 0;
+
+  if (cfun && cfun->machine && cfun->machine->normal_function)
+    {
+      fputs ("\t.type\t", file);
+      assemble_name (file, name);
+      fputs (",@function\n", file);
+      assemble_name (file, name);
+      fputs (":\n", file);
+      return;
+    }
+
+  if (!leaf_function_p ())
+    {
+      /* We can know how many registers function calls might use.  */
+      /* FIXME: restrict normal functions to a smaller set that allows
+         more optimal use of wavefronts.  */
+      vgpr = 256;
+      sgpr = 102;
+      extra_regs = 0;
+    }
+  else
+    {
+      /* Determine count of sgpr/vgpr registers by looking for last
+         one used.  */
+      for (sgpr = 101; sgpr >= 0; sgpr--)
+	if (df_regs_ever_live_p (FIRST_SGPR_REG + sgpr))
+	  break;
+      sgpr++;
+      for (vgpr = 255; vgpr >= 0; vgpr--)
+	if (df_regs_ever_live_p (FIRST_VGPR_REG + vgpr))
+	  break;
+      vgpr++;
+
+      if (xnack_enabled)
+	extra_regs = 6;
+      if (df_regs_ever_live_p (FLAT_SCRATCH_LO_REG)
+	  || df_regs_ever_live_p (FLAT_SCRATCH_HI_REG))
+	extra_regs = 4;
+      else if (df_regs_ever_live_p (VCC_LO_REG)
+	       || df_regs_ever_live_p (VCC_HI_REG))
+	extra_regs = 2;
+    }
+
+  fputs ("\t.align\t256\n", file);
+  fputs ("\t.type\t", file);
+  assemble_name (file, name);
+  fputs (",@function\n\t.amdgpu_hsa_kernel\t", file);
+  assemble_name (file, name);
+  fputs ("\n", file);
+  assemble_name (file, name);
+  fputs (":\n", file);
+  fprintf (file, "\t.amd_kernel_code_t\n"
+	   "\t\tkernel_code_version_major = 1\n"
+	   "\t\tkernel_code_version_minor = 0\n" "\t\tmachine_kind = 1\n"
+	   /* "\t\tmachine_version_major = 8\n"
+	      "\t\tmachine_version_minor = 0\n"
+	      "\t\tmachine_version_stepping = 1\n" */
+	   "\t\tkernel_code_entry_byte_offset = 256\n"
+	   "\t\tkernel_code_prefetch_byte_size = 0\n"
+	   "\t\tmax_scratch_backing_memory_byte_size = 0\n"
+	   "\t\tcompute_pgm_rsrc1_vgprs = %i\n"
+	   "\t\tcompute_pgm_rsrc1_sgprs = %i\n"
+	   "\t\tcompute_pgm_rsrc1_priority = 0\n"
+	   "\t\tcompute_pgm_rsrc1_float_mode = 192\n"
+	   "\t\tcompute_pgm_rsrc1_priv = 0\n"
+	   "\t\tcompute_pgm_rsrc1_dx10_clamp = 1\n"
+	   "\t\tcompute_pgm_rsrc1_debug_mode = 0\n"
+	   "\t\tcompute_pgm_rsrc1_ieee_mode = 1\n"
+	   /* We enable scratch memory.  */
+	   "\t\tcompute_pgm_rsrc2_scratch_en = 1\n"
+	   "\t\tcompute_pgm_rsrc2_user_sgpr = %i\n"
+	   "\t\tcompute_pgm_rsrc2_tgid_x_en = 1\n"
+	   "\t\tcompute_pgm_rsrc2_tgid_y_en = 0\n"
+	   "\t\tcompute_pgm_rsrc2_tgid_z_en = 0\n"
+	   "\t\tcompute_pgm_rsrc2_tg_size_en = 0\n"
+	   "\t\tcompute_pgm_rsrc2_tidig_comp_cnt = 0\n"
+	   "\t\tcompute_pgm_rsrc2_excp_en_msb = 0\n"
+	   "\t\tcompute_pgm_rsrc2_lds_size = 0\n"	/*FIXME */
+	   "\t\tcompute_pgm_rsrc2_excp_en = 0\n",
+	   (vgpr - 1) / 4,
+	   /* Must match wavefront_sgpr_count */
+	   (sgpr + extra_regs - 1) / 8,
+	   /* The total number of SGPR user data registers requested.  This
+	      number must match the number of user data registers enabled.  */
+	   cfun->machine->args.nsgprs);
+  int reg = FIRST_SGPR_REG;
+  for (int a = 0; a < GCN_KERNEL_ARG_TYPES; a++)
+    {
+      int reg_first = -1;
+      int reg_last;
+      if ((cfun->machine->args.requested & (1 << a))
+	  && (gcn_kernel_arg_types[a].fixed_regno < 0))
+	{
+	  reg_first = reg;
+	  reg_last = (reg_first
+		      + (GET_MODE_SIZE (gcn_kernel_arg_types[a].mode)
+			 / UNITS_PER_WORD) - 1);
+	  reg = reg_last + 1;
+	}
+
+      if (gcn_kernel_arg_types[a].header_pseudo)
+	{
+	  fprintf (file, "\t\t%s = %i",
+		   gcn_kernel_arg_types[a].header_pseudo,
+		   (cfun->machine->args.requested & (1 << a)) != 0);
+	  if (reg_first != -1)
+	    {
+	      fprintf (file, " ; (");
+	      for (int i = reg_first; i <= reg_last; ++i)
+		{
+		  if (i != reg_first)
+		    fprintf (file, ", ");
+		  fprintf (file, "%s", reg_names[i]);
+		}
+	      fprintf (file, ")");
+	    }
+	  fprintf (file, "\n");
+	}
+      else if (gcn_kernel_arg_types[a].fixed_regno >= 0
+	       && cfun->machine->args.requested & (1 << a))
+	fprintf (file, "\t\t; %s = %i (%s)\n",
+		 gcn_kernel_arg_types[a].name,
+		 (cfun->machine->args.requested & (1 << a)) != 0,
+		 reg_names[gcn_kernel_arg_types[a].fixed_regno]);
+    }
+  fprintf (file, "\t\tenable_vgpr_workitem_id = %i\n",
+	   (cfun->machine->args.requested & (1 << WORK_ITEM_ID_Z_ARG))
+	   ? 2
+	   : cfun->machine->args.requested & (1 << WORK_ITEM_ID_Y_ARG)
+	   ? 1 : 0);
+  fprintf (file, "\t\tenable_ordered_append_gds = 0\n"
+	   "\t\tprivate_element_size = 1\n"
+	   "\t\tis_ptr64 = 1\n"
+	   "\t\tis_dynamic_callstack = 0\n"
+	   "\t\tis_debug_enabled = 0\n"
+	   "\t\tis_xnack_enabled = %i\n"
+	   "\t\tworkitem_private_segment_byte_size = %i\n"
+	   "\t\tworkgroup_group_segment_byte_size = %u\n"
+	   "\t\tgds_segment_byte_size = 0\n"
+	   "\t\tkernarg_segment_byte_size = %i\n"
+	   "\t\tworkgroup_fbarrier_count = 0\n"
+	   "\t\twavefront_sgpr_count = %i\n"
+	   "\t\tworkitem_vgpr_count = %i\n"
+	   "\t\treserved_vgpr_first = 0\n"
+	   "\t\treserved_vgpr_count = 0\n"
+	   "\t\treserved_sgpr_first = 0\n"
+	   "\t\treserved_sgpr_count = 0\n"
+	   "\t\tdebug_wavefront_private_segment_offset_sgpr = 0\n"
+	   "\t\tdebug_private_segment_buffer_sgpr = 0\n"
+	   "\t\tkernarg_segment_alignment = %i\n"
+	   "\t\tgroup_segment_alignment = 4\n"
+	   "\t\tprivate_segment_alignment = %i\n"
+	   "\t\twavefront_size = 6\n"
+	   "\t\tcall_convention = 0\n"
+	   "\t\truntime_loader_kernel_symbol = 0\n"
+	   "\t.end_amd_kernel_code_t\n", xnack_enabled,
+	   /* workitem_private_segment_bytes_size needs to be
+	      one 64th the wave-front stack size.  */
+	   stack_size_opt / 64,
+	   LDS_SIZE, cfun->machine->kernarg_segment_byte_size,
+	   /* Number of scalar registers used by a wavefront.  This
+	      includes the special SGPRs for VCC, Flat Scratch (Base,
+	      Size) and XNACK (for GFX8 (VI)+).  It does not include the
+	      16 SGPR added if a trap handler is enabled.  Must match
+	      compute_pgm_rsrc1.sgprs.  */
+	   sgpr + extra_regs, vgpr,
+	   cfun->machine->kernarg_segment_alignment,
+	   crtl->stack_alignment_needed / 8);
+
+  /* This comment is read by mkoffload.  */
+  if (flag_openacc)
+    fprintf (file, "\t;; OPENACC-DIMS: %d, %d, %d : %s\n",
+	     oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_GANG),
+	     oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_WORKER),
+	     oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_VECTOR), name);
+}
+
+/* Implement TARGET_ASM_SELECT_SECTION.
+
+   Return the section into which EXP should be placed.  */
+
+static section *
+gcn_asm_select_section (tree exp, int reloc, unsigned HOST_WIDE_INT align)
+{
+  if (TREE_TYPE (exp) != error_mark_node
+      && TYPE_ADDR_SPACE (TREE_TYPE (exp)) == ADDR_SPACE_LDS)
+    {
+      if (!DECL_P (exp))
+	return get_section (".lds_bss",
+			    SECTION_WRITE | SECTION_BSS | SECTION_DEBUG,
+			    NULL);
+
+      return get_named_section (exp, ".lds_bss", reloc);
+    }
+
+  return default_elf_select_section (exp, reloc, align);
+}
+
+/* Implement TARGET_ASM_FUNCTION_PROLOGUE.
+ 
+   Emits custom text into the assembler file at the head of each function.  */
+
+static void
+gcn_target_asm_function_prologue (FILE *file)
+{
+  machine_function *offsets = gcn_compute_frame_offsets ();
+
+  asm_fprintf (file, "\t; using %s addressing in function\n",
+	       offsets->use_flat_addressing ? "flat" : "global");
+
+  if (offsets->normal_function)
+    {
+      asm_fprintf (file, "\t; frame pointer needed: %s\n",
+		   offsets->need_frame_pointer ? "true" : "false");
+      asm_fprintf (file, "\t; lr needs saving: %s\n",
+		   offsets->lr_needs_saving ? "true" : "false");
+      asm_fprintf (file, "\t; outgoing args size: %wd\n",
+		   offsets->outgoing_args_size);
+      asm_fprintf (file, "\t; pretend size: %wd\n", offsets->pretend_size);
+      asm_fprintf (file, "\t; local vars size: %wd\n", offsets->local_vars);
+      asm_fprintf (file, "\t; callee save size: %wd\n",
+		   offsets->callee_saves);
+    }
+  else
+    {
+      asm_fprintf (file, "\t; HSA kernel entry point\n");
+      asm_fprintf (file, "\t; local vars size: %wd\n", offsets->local_vars);
+      asm_fprintf (file, "\t; outgoing args size: %wd\n",
+		   offsets->outgoing_args_size);
+
+      /* Enable denorms.  */
+      asm_fprintf (file, "\n\t; Set MODE[FP_DENORM]: allow single and double"
+		   " input and output denorms\n");
+      asm_fprintf (file, "\ts_setreg_imm32_b32\thwreg(1, 4, 4), 0xf\n\n");
+    }
+}
+
+/* Helper function for print_operand and print_operand_address.
+
+   Print a register as the assembler requires, according to mode and name.  */
+
+static void
+print_reg (FILE *file, rtx x)
+{
+  machine_mode mode = GET_MODE (x);
+  if (mode == BImode || mode == QImode || mode == HImode || mode == SImode
+      || mode == HFmode || mode == SFmode
+      || mode == V64SFmode || mode == V64SImode
+      || mode == V64QImode || mode == V64HImode)
+    fprintf (file, "%s", reg_names[REGNO (x)]);
+  else if (mode == DImode || mode == V64DImode
+	   || mode == DFmode || mode == V64DFmode)
+    {
+      if (SGPR_REGNO_P (REGNO (x)))
+	fprintf (file, "s[%i:%i]", REGNO (x) - FIRST_SGPR_REG,
+		 REGNO (x) - FIRST_SGPR_REG + 1);
+      else if (VGPR_REGNO_P (REGNO (x)))
+	fprintf (file, "v[%i:%i]", REGNO (x) - FIRST_VGPR_REG,
+		 REGNO (x) - FIRST_VGPR_REG + 1);
+      else if (REGNO (x) == FLAT_SCRATCH_REG)
+	fprintf (file, "flat_scratch");
+      else if (REGNO (x) == EXEC_REG)
+	fprintf (file, "exec");
+      else if (REGNO (x) == VCC_LO_REG)
+	fprintf (file, "vcc");
+      else
+	fprintf (file, "[%s:%s]",
+		 reg_names[REGNO (x)], reg_names[REGNO (x) + 1]);
+    }
+  else if (mode == TImode)
+    {
+      if (SGPR_REGNO_P (REGNO (x)))
+	fprintf (file, "s[%i:%i]", REGNO (x) - FIRST_SGPR_REG,
+		 REGNO (x) - FIRST_SGPR_REG + 3);
+      else if (VGPR_REGNO_P (REGNO (x)))
+	fprintf (file, "v[%i:%i]", REGNO (x) - FIRST_VGPR_REG,
+		 REGNO (x) - FIRST_VGPR_REG + 3);
+      else
+	gcc_unreachable ();
+    }
+  else
+    gcc_unreachable ();
+}
+
+/* Implement TARGET_SECTION_TYPE_FLAGS.
+
+   Return a set of section attributes for use by TARGET_ASM_NAMED_SECTION.  */
+
+static unsigned int
+gcn_section_type_flags (tree decl, const char *name, int reloc)
+{
+  if (strcmp (name, ".lds_bss") == 0)
+    return SECTION_WRITE | SECTION_BSS | SECTION_DEBUG;
+
+  return default_section_type_flags (decl, name, reloc);
+}
+
+/* Helper function for gcn_asm_output_symbol_ref.
+
+   FIXME: If we want to have propagation blocks allocated separately and
+   statically like this, it would be better done via symbol refs and the
+   assembler/linker.  This is a temporary hack.  */
+
+static void
+gcn_print_lds_decl (FILE *f, tree var)
+{
+  int *offset;
+  machine_function *machfun = cfun->machine;
+
+  if ((offset = machfun->lds_allocs->get (var)))
+    fprintf (f, "%u", (unsigned) *offset);
+  else
+    {
+      unsigned HOST_WIDE_INT align = DECL_ALIGN_UNIT (var);
+      tree type = TREE_TYPE (var);
+      unsigned HOST_WIDE_INT size = tree_to_uhwi (TYPE_SIZE_UNIT (type));
+      if (size > align && size > 4 && align < 8)
+	align = 8;
+
+      machfun->lds_allocated = ((machfun->lds_allocated + align - 1)
+				& ~(align - 1));
+
+      machfun->lds_allocs->put (var, machfun->lds_allocated);
+      fprintf (f, "%u", machfun->lds_allocated);
+      machfun->lds_allocated += size;
+      if (machfun->lds_allocated > LDS_SIZE)
+	error ("local data-share memory exhausted");
+    }
+}
+
+/* Implement ASM_OUTPUT_SYMBOL_REF via gcn-hsa.h.  */
+
+void
+gcn_asm_output_symbol_ref (FILE *file, rtx x)
+{
+  tree decl;
+  if ((decl = SYMBOL_REF_DECL (x)) != 0
+      && TREE_CODE (decl) == VAR_DECL
+      && AS_LDS_P (TYPE_ADDR_SPACE (TREE_TYPE (decl))))
+    {
+      /* LDS symbols (emitted using this hook) are only used at present
+         to propagate worker values from an active thread to neutered
+         threads.  Use the same offset for each such block, but don't
+         use zero because null pointers are used to identify the active
+         thread in GOACC_single_copy_start calls.  */
+      gcn_print_lds_decl (file, decl);
+    }
+  else
+    {
+      assemble_name (file, XSTR (x, 0));
+      /* FIXME: See above -- this condition is unreachable.  */
+      if ((decl = SYMBOL_REF_DECL (x)) != 0
+	  && TREE_CODE (decl) == VAR_DECL
+	  && AS_LDS_P (TYPE_ADDR_SPACE (TREE_TYPE (decl))))
+	fputs ("@abs32", file);
+    }
+}
+
+/* Implement TARGET_CONSTANT_ALIGNMENT.
+ 
+   Returns the alignment in bits of a constant that is being placed in memory.
+   CONSTANT is the constant and BASIC_ALIGN is the alignment that the object
+   would ordinarily have.  */
+
+static HOST_WIDE_INT
+gcn_constant_alignment (const_tree ARG_UNUSED (constant),
+			HOST_WIDE_INT basic_align)
+{
+  return basic_align > 128 ? basic_align : 128;
+}
+
+/* Implement TARGET_VECTOR_ALIGNMENT.
+ 
+   The alignment returned by this hook must be a power-of-two multiple of the
+   default alignment of the vector element type.  */
+
+static HOST_WIDE_INT
+gcn_vector_alignment (const_tree type)
+{
+  /* V64BImode is a special case because it gets converted to DImode.  This
+     definition needs to not trip asserts within build_truth_vector_type.  */
+  if (TYPE_MODE (type) == V64BImode)
+    return 64;
+
+  HOST_WIDE_INT vec_align = tree_to_shwi (TYPE_SIZE (type));
+  HOST_WIDE_INT elem_align = tree_to_shwi (TYPE_SIZE (TREE_TYPE (type)));
+  HOST_WIDE_INT align = vec_align;
+
+  /* Use the size (natural alignment) of the element type if we have a
+     64-element vector.  At present, smaller vectors will most likely use
+     scalar (load/store) instructions.  This definition will probably need
+     attention if support is added for fewer-element vectors in vector
+     regs.  */
+  if (TYPE_VECTOR_SUBPARTS (type) == 64)
+    align = elem_align;
+
+  return (align > 64) ? 64 : align;
+}
+
+/* Implement PRINT_OPERAND_ADDRESS via gcn.h.  */
+
+void
+print_operand_address (FILE *file, rtx mem)
+{
+  gcc_assert (MEM_P (mem));
+
+  rtx reg;
+  rtx offset;
+  addr_space_t as = MEM_ADDR_SPACE (mem);
+  rtx addr = XEXP (mem, 0);
+  gcc_assert (REG_P (addr) || GET_CODE (addr) == PLUS);
+
+  if (AS_SCRATCH_P (as))
+    switch (GET_CODE (addr))
+      {
+      case REG:
+	print_reg (file, addr);
+	break;
+
+      case PLUS:
+	reg = XEXP (addr, 0);
+	offset = XEXP (addr, 1);
+	print_reg (file, reg);
+	if (GET_CODE (offset) == CONST_INT)
+	  fprintf (file, " offset:" HOST_WIDE_INT_PRINT_DEC, INTVAL (offset));
+	else
+	  abort ();
+	break;
+
+      default:
+	debug_rtx (addr);
+	abort ();
+      }
+  else if (AS_ANY_FLAT_P (as))
+    {
+      if (GET_CODE (addr) == REG)
+	print_reg (file, addr);
+      else
+	{
+	  gcc_assert (TARGET_GCN5_PLUS);
+	  print_reg (file, XEXP (addr, 0));
+	}
+    }
+  else if (AS_GLOBAL_P (as))
+    {
+      gcc_assert (TARGET_GCN5_PLUS);
+
+      rtx base = addr;
+      rtx vgpr_offset = NULL_RTX;
+
+      if (GET_CODE (addr) == PLUS)
+	{
+	  base = XEXP (addr, 0);
+
+	  if (GET_CODE (base) == PLUS)
+	    {
+	      /* (SGPR + VGPR) + CONST  */
+	      vgpr_offset = XEXP (base, 1);
+	      base = XEXP (base, 0);
+	    }
+	  else
+	    {
+	      rtx offset = XEXP (addr, 1);
+
+	      if (REG_P (offset))
+		/* SGPR + VGPR  */
+		vgpr_offset = offset;
+	      else if (CONST_INT_P (offset))
+		/* VGPR + CONST or SGPR + CONST  */
+		;
+	      else
+		output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
+	    }
+	}
+
+      if (REG_P (base))
+	{
+	  if (VGPR_REGNO_P (REGNO (base)))
+	    print_reg (file, base);
+	  else if (SGPR_REGNO_P (REGNO (base)))
+	    {
+	      /* The assembler requires a 64-bit VGPR pair here, even though
+	         the offset should be only 32-bit.  */
+	      if (vgpr_offset == NULL_RTX)
+		/* In this case, the vector offset is zero, so we use v0,
+		   which is initialized by the kernel prologue to zero.  */
+		fprintf (file, "v[0:1]");
+	      else if (REG_P (vgpr_offset)
+		       && VGPR_REGNO_P (REGNO (vgpr_offset)))
+		{
+		  fprintf (file, "v[%d:%d]",
+			   REGNO (vgpr_offset) - FIRST_VGPR_REG,
+			   REGNO (vgpr_offset) - FIRST_VGPR_REG + 1);
+		}
+	      else
+		output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
+	    }
+	}
+      else
+	output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
+    }
+  else if (AS_ANY_DS_P (as))
+    switch (GET_CODE (addr))
+      {
+      case REG:
+	print_reg (file, addr);
+	break;
+
+      case PLUS:
+	reg = XEXP (addr, 0);
+	print_reg (file, reg);
+	break;
+
+      default:
+	debug_rtx (addr);
+	abort ();
+      }
+  else
+    switch (GET_CODE (addr))
+      {
+      case REG:
+	print_reg (file, addr);
+	fprintf (file, ", 0");
+	break;
+
+      case PLUS:
+	reg = XEXP (addr, 0);
+	offset = XEXP (addr, 1);
+	print_reg (file, reg);
+	fprintf (file, ", ");
+	if (GET_CODE (offset) == REG)
+	  print_reg (file, reg);
+	else if (GET_CODE (offset) == CONST_INT)
+	  fprintf (file, HOST_WIDE_INT_PRINT_DEC, INTVAL (offset));
+	else
+	  abort ();
+	break;
+
+      default:
+	debug_rtx (addr);
+	abort ();
+      }
+}
+
+/* Implement PRINT_OPERAND via gcn.h.
+
+   b - print operand size as untyped operand (b8/b16/b32/b64)
+   B - print operand size as SI/DI untyped operand (b32/b32/b32/b64)
+   i - print operand size as untyped operand (i16/b32/i64)
+   u - print operand size as untyped operand (u16/u32/u64)
+   o - print operand size as memory access size for loads
+       (ubyte/ushort/dword/dwordx2/wordx3/dwordx4)
+   s - print operand size as memory access size for stores
+       (byte/short/dword/dwordx2/wordx3/dwordx4)
+   C - print conditional code for s_cbranch (_sccz/_sccnz/_vccz/_vccnz...)
+   D - print conditional code for s_cmp (eq_u64/lg_u64...)
+   E - print conditional code for v_cmp (eq_u64/ne_u64...)
+   A - print address in formatting suitable for given address space.
+   O - print offset:n for data share operations.
+   ^ - print "_co" suffix for GCN5 mnemonics
+   g - print "glc", if appropriate for given MEM
+ */
+
+void
+print_operand (FILE *file, rtx x, int code)
+{
+  int xcode = x ? GET_CODE (x) : 0;
+  switch (code)
+    {
+      /* Instructions have the following suffixes.
+         If there are two suffixes, the first is the destination type,
+	 and the second is the source type.
+
+         B32 Bitfield (untyped data) 32-bit
+         B64 Bitfield (untyped data) 64-bit
+         F16 floating-point 16-bit
+         F32 floating-point 32-bit (IEEE 754 single-precision float)
+         F64 floating-point 64-bit (IEEE 754 double-precision float)
+         I16 signed 32-bit integer
+         I32 signed 32-bit integer
+         I64 signed 64-bit integer
+         U16 unsigned 32-bit integer
+         U32 unsigned 32-bit integer
+         U64 unsigned 64-bit integer  */
+
+      /* Print operand size as untyped suffix.  */
+    case 'b':
+      {
+	const char *s = "";
+	machine_mode mode = GET_MODE (x);
+	if (VECTOR_MODE_P (mode))
+	  mode = GET_MODE_INNER (mode);
+	switch (GET_MODE_SIZE (mode))
+	  {
+	  case 1:
+	    s = "_b8";
+	    break;
+	  case 2:
+	    s = "_b16";
+	    break;
+	  case 4:
+	    s = "_b32";
+	    break;
+	  case 8:
+	    s = "_b64";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid operand %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+      }
+      return;
+    case 'B':
+      {
+	const char *s = "";
+	machine_mode mode = GET_MODE (x);
+	if (VECTOR_MODE_P (mode))
+	  mode = GET_MODE_INNER (mode);
+	switch (GET_MODE_SIZE (mode))
+	  {
+	  case 1:
+	  case 2:
+	  case 4:
+	    s = "_b32";
+	    break;
+	  case 8:
+	    s = "_b64";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid operand %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+      }
+      return;
+    case 'e':
+      fputs ("sext(", file);
+      print_operand (file, x, 0);
+      fputs (")", file);
+      return;
+    case 'i':
+    case 'u':
+      {
+	bool signed_p = code == 'i';
+	const char *s = "";
+	machine_mode mode = GET_MODE (x);
+	if (VECTOR_MODE_P (mode))
+	  mode = GET_MODE_INNER (mode);
+	if (mode == VOIDmode)
+	  switch (GET_CODE (x))
+	    {
+	    case CONST_INT:
+	      s = signed_p ? "_i32" : "_u32";
+	      break;
+	    case CONST_DOUBLE:
+	      s = "_f64";
+	      break;
+	    default:
+	      output_operand_lossage ("invalid operand %%xn code");
+	      return;
+	    }
+	else if (FLOAT_MODE_P (mode))
+	  switch (GET_MODE_SIZE (mode))
+	    {
+	    case 2:
+	      s = "_f16";
+	      break;
+	    case 4:
+	      s = "_f32";
+	      break;
+	    case 8:
+	      s = "_f64";
+	      break;
+	    default:
+	      output_operand_lossage ("invalid operand %%xn code");
+	      return;
+	    }
+	else
+	  switch (GET_MODE_SIZE (mode))
+	    {
+	    case 1:
+	      s = signed_p ? "_i8" : "_u8";
+	      break;
+	    case 2:
+	      s = signed_p ? "_i16" : "_u16";
+	      break;
+	    case 4:
+	      s = signed_p ? "_i32" : "_u32";
+	      break;
+	    case 8:
+	      s = signed_p ? "_i64" : "_u64";
+	      break;
+	    default:
+	      output_operand_lossage ("invalid operand %%xn code");
+	      return;
+	    }
+	fputs (s, file);
+      }
+      return;
+      /* Print operand size as untyped suffix.  */
+    case 'o':
+      {
+	const char *s = 0;
+	switch (GET_MODE_SIZE (GET_MODE (x)))
+	  {
+	  case 1:
+	    s = "_ubyte";
+	    break;
+	  case 2:
+	    s = "_ushort";
+	    break;
+	  /* The following are full-vector variants.  */
+	  case 64:
+	    s = "_ubyte";
+	    break;
+	  case 128:
+	    s = "_ushort";
+	    break;
+	  }
+
+	if (s)
+	  {
+	    fputs (s, file);
+	    return;
+	  }
+
+	/* Fall-through - the other cases for 'o' are the same as for 's'.  */
+      }
+    case 's':
+      {
+	const char *s = "";
+	switch (GET_MODE_SIZE (GET_MODE (x)))
+	  {
+	  case 1:
+	    s = "_byte";
+	    break;
+	  case 2:
+	    s = "_short";
+	    break;
+	  case 4:
+	    s = "_dword";
+	    break;
+	  case 8:
+	    s = "_dwordx2";
+	    break;
+	  case 12:
+	    s = "_dwordx3";
+	    break;
+	  case 16:
+	    s = "_dwordx4";
+	    break;
+	  case 32:
+	    s = "_dwordx8";
+	    break;
+	  case 64:
+	    s = VECTOR_MODE_P (GET_MODE (x)) ? "_byte" : "_dwordx16";
+	    break;
+	  /* The following are full-vector variants.  */
+	  case 128:
+	    s = "_short";
+	    break;
+	  case 256:
+	    s = "_dword";
+	    break;
+	  case 512:
+	    s = "_dwordx2";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid operand %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+      }
+      return;
+    case 'A':
+      if (xcode != MEM)
+	{
+	  output_operand_lossage ("invalid %%xn code");
+	  return;
+	}
+      print_operand_address (file, x);
+      return;
+    case 'O':
+      {
+	if (xcode != MEM)
+	  {
+	    output_operand_lossage ("invalid %%xn code");
+	    return;
+	  }
+	if (AS_GDS_P (MEM_ADDR_SPACE (x)))
+	  fprintf (file, " gds");
+
+	rtx x0 = XEXP (x, 0);
+	if (AS_GLOBAL_P (MEM_ADDR_SPACE (x)))
+	  {
+	    gcc_assert (TARGET_GCN5_PLUS);
+
+	    fprintf (file, ", ");
+
+	    rtx base = x0;
+	    rtx const_offset = NULL_RTX;
+
+	    if (GET_CODE (base) == PLUS)
+	      {
+		rtx offset = XEXP (x0, 1);
+		base = XEXP (x0, 0);
+
+		if (GET_CODE (base) == PLUS)
+		  /* (SGPR + VGPR) + CONST  */
+		  /* Ignore the VGPR offset for this operand.  */
+		  base = XEXP (base, 0);
+
+		if (CONST_INT_P (offset))
+		  const_offset = XEXP (x0, 1);
+		else if (REG_P (offset))
+		  /* SGPR + VGPR  */
+		  /* Ignore the VGPR offset for this operand.  */
+		  ;
+		else
+		  output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
+	      }
+
+	    if (REG_P (base))
+	      {
+		if (VGPR_REGNO_P (REGNO (base)))
+		  /* The VGPR address is specified in the %A operand.  */
+		  fprintf (file, "off");
+		else if (SGPR_REGNO_P (REGNO (base)))
+		  print_reg (file, base);
+		else
+		  output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
+	      }
+	    else
+	      output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
+
+	    if (const_offset != NULL_RTX)
+	      fprintf (file, " offset:" HOST_WIDE_INT_PRINT_DEC,
+		       INTVAL (const_offset));
+
+	    return;
+	  }
+
+	if (GET_CODE (x0) == REG)
+	  return;
+	if (GET_CODE (x0) != PLUS)
+	  {
+	    output_operand_lossage ("invalid %%xn code");
+	    return;
+	  }
+	rtx val = XEXP (x0, 1);
+	if (GET_CODE (val) == CONST_VECTOR)
+	  val = CONST_VECTOR_ELT (val, 0);
+	if (GET_CODE (val) != CONST_INT)
+	  {
+	    output_operand_lossage ("invalid %%xn code");
+	    return;
+	  }
+	fprintf (file, " offset:" HOST_WIDE_INT_PRINT_DEC, INTVAL (val));
+
+      }
+      return;
+    case 'C':
+      {
+	const char *s;
+	bool num = false;
+	if ((xcode != EQ && xcode != NE) || !REG_P (XEXP (x, 0)))
+	  {
+	    output_operand_lossage ("invalid %%xn code");
+	    return;
+	  }
+	switch (REGNO (XEXP (x, 0)))
+	  {
+	  case VCCZ_REG:
+	    s = "_vcc";
+	    break;
+	  case SCC_REG:
+	    /* For some reason llvm-mc insists on scc0 instead of sccz.  */
+	    num = true;
+	    s = "_scc";
+	    break;
+	  case EXECZ_REG:
+	    s = "_exec";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+	if (xcode == EQ)
+	  fputc (num ? '0' : 'z', file);
+	else
+	  fputs (num ? "1" : "nz", file);
+	return;
+      }
+    case 'D':
+      {
+	const char *s;
+	bool cmp_signed = false;
+	switch (xcode)
+	  {
+	  case EQ:
+	    s = "_eq_";
+	    break;
+	  case NE:
+	    s = "_lg_";
+	    break;
+	  case LT:
+	    s = "_lt_";
+	    cmp_signed = true;
+	    break;
+	  case LE:
+	    s = "_le_";
+	    cmp_signed = true;
+	    break;
+	  case GT:
+	    s = "_gt_";
+	    cmp_signed = true;
+	    break;
+	  case GE:
+	    s = "_ge_";
+	    cmp_signed = true;
+	    break;
+	  case LTU:
+	    s = "_lt_";
+	    break;
+	  case LEU:
+	    s = "_le_";
+	    break;
+	  case GTU:
+	    s = "_gt_";
+	    break;
+	  case GEU:
+	    s = "_ge_";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+	fputc (cmp_signed ? 'i' : 'u', file);
+
+	machine_mode mode = GET_MODE (XEXP (x, 0));
+
+	if (mode == VOIDmode)
+	  mode = GET_MODE (XEXP (x, 1));
+
+	/* If both sides are constants, then assume the instruction is in
+	   SImode since s_cmp can only do integer compares.  */
+	if (mode == VOIDmode)
+	  mode = SImode;
+
+	switch (GET_MODE_SIZE (mode))
+	  {
+	  case 4:
+	    s = "32";
+	    break;
+	  case 8:
+	    s = "64";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid operand %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+	return;
+      }
+    case 'E':
+      {
+	const char *s;
+	bool cmp_signed = false;
+	machine_mode mode = GET_MODE (XEXP (x, 0));
+
+	if (mode == VOIDmode)
+	  mode = GET_MODE (XEXP (x, 1));
+
+	/* If both sides are constants, assume the instruction is in SFmode
+	   if either operand is floating point, otherwise assume SImode.  */
+	if (mode == VOIDmode)
+	  {
+	    if (GET_CODE (XEXP (x, 0)) == CONST_DOUBLE
+		|| GET_CODE (XEXP (x, 1)) == CONST_DOUBLE)
+	      mode = SFmode;
+	    else
+	      mode = SImode;
+	  }
+
+	/* Use the same format code for vector comparisons.  */
+	if (GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT
+	    || GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+	  mode = GET_MODE_INNER (mode);
+
+	bool float_p = GET_MODE_CLASS (mode) == MODE_FLOAT;
+
+	switch (xcode)
+	  {
+	  case EQ:
+	    s = "_eq_";
+	    break;
+	  case NE:
+	    s = float_p ? "_neq_" : "_ne_";
+	    break;
+	  case LT:
+	    s = "_lt_";
+	    cmp_signed = true;
+	    break;
+	  case LE:
+	    s = "_le_";
+	    cmp_signed = true;
+	    break;
+	  case GT:
+	    s = "_gt_";
+	    cmp_signed = true;
+	    break;
+	  case GE:
+	    s = "_ge_";
+	    cmp_signed = true;
+	    break;
+	  case LTU:
+	    s = "_lt_";
+	    break;
+	  case LEU:
+	    s = "_le_";
+	    break;
+	  case GTU:
+	    s = "_gt_";
+	    break;
+	  case GEU:
+	    s = "_ge_";
+	    break;
+	  case ORDERED:
+	    s = "_o_";
+	    break;
+	  case UNORDERED:
+	    s = "_u_";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+	fputc (float_p ? 'f' : cmp_signed ? 'i' : 'u', file);
+
+	switch (GET_MODE_SIZE (mode))
+	  {
+	  case 1:
+	    s = "32";
+	    break;
+	  case 2:
+	    s = float_p ? "16" : "32";
+	    break;
+	  case 4:
+	    s = "32";
+	    break;
+	  case 8:
+	    s = "64";
+	    break;
+	  default:
+	    output_operand_lossage ("invalid operand %%xn code");
+	    return;
+	  }
+	fputs (s, file);
+	return;
+      }
+    case 'L':
+      print_operand (file, gcn_operand_part (GET_MODE (x), x, 0), 0);
+      return;
+    case 'H':
+      print_operand (file, gcn_operand_part (GET_MODE (x), x, 1), 0);
+      return;
+    case 'R':
+      /* Print a scalar register number as an integer.  Temporary hack.  */
+      gcc_assert (REG_P (x));
+      fprintf (file, "%u", (int) REGNO (x));
+      return;
+    case 'V':
+      /* Print a vector register number as an integer.  Temporary hack.  */
+      gcc_assert (REG_P (x));
+      fprintf (file, "%u", (int) REGNO (x) - FIRST_VGPR_REG);
+      return;
+    case 0:
+      if (xcode == REG)
+	print_reg (file, x);
+      else if (xcode == MEM)
+	output_address (GET_MODE (x), x);
+      else if (xcode == CONST_INT)
+	fprintf (file, "%i", (int) INTVAL (x));
+      else if (xcode == CONST_VECTOR)
+	print_operand (file, CONST_VECTOR_ELT (x, 0), code);
+      else if (xcode == CONST_DOUBLE)
+	{
+	  const char *str;
+	  switch (gcn_inline_fp_constant_p (x, false))
+	    {
+	    case 240:
+	      str = "0.5";
+	      break;
+	    case 241:
+	      str = "-0.5";
+	      break;
+	    case 242:
+	      str = "1.0";
+	      break;
+	    case 243:
+	      str = "-1.0";
+	      break;
+	    case 244:
+	      str = "2.0";
+	      break;
+	    case 245:
+	      str = "-2.0";
+	      break;
+	    case 246:
+	      str = "4.0";
+	      break;
+	    case 247:
+	      str = "-4.0";
+	      break;
+	    case 248:
+	      str = "1/pi";
+	      break;
+	    default:
+	      rtx ix = simplify_gen_subreg (GET_MODE (x) == DFmode
+					    ? DImode : SImode,
+					    x, GET_MODE (x), 0);
+	      if (x)
+		print_operand (file, ix, code);
+	      else
+		output_operand_lossage ("invlaid fp constant");
+	      return;
+	      break;
+	    }
+	  fprintf (file, str);
+	  return;
+	}
+      else
+	output_addr_const (file, x);
+      return;
+    case '^':
+      if (TARGET_GCN5_PLUS)
+	fputs ("_co", file);
+      return;
+    case 'g':
+      gcc_assert (xcode == MEM);
+      if (MEM_VOLATILE_P (x))
+	fputs (" glc", file);
+      return;
+    default:
+      output_operand_lossage ("invalid %%xn code");
+    }
+  gcc_unreachable ();
+}
+
+/* Return a hash value calculated from NAME.  Used by
+   ASM_FORMAT_PRIVATE_NAME.  */
+
+unsigned int
+gcn_local_sym_hash (const char *name)
+{
+  unsigned int val = 0;
+
+  if (!name)
+    return 0;
+
+  for (int i = 0; name[i]; i++)
+    val = val * 223 + name[i];
+
+  return val;
+}
+
+static tree
+gcn_mangle_decl_assembler_name (tree decl, tree id)
+{
+  if (TREE_CODE (decl) == VAR_DECL
+      && TREE_STATIC (decl)
+      && !TREE_PUBLIC (decl)
+      && local_symbol_id
+      && *local_symbol_id)
+    {
+      const char *name = IDENTIFIER_POINTER (id);
+      char *newname = (char *) alloca (strlen (name) + 16);
+
+      sprintf (newname, "%s.%.8x", name, gcn_local_sym_hash (local_symbol_id));
+
+      return get_identifier (newname);
+    }
+  else
+    return default_mangle_decl_assembler_name (decl, id);
+}
+
+/* }}}  */
+/* {{{ TARGET hook overrides.  */
+
+#undef  TARGET_ADDR_SPACE_ADDRESS_MODE
+#define TARGET_ADDR_SPACE_ADDRESS_MODE gcn_addr_space_address_mode
+#undef  TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P
+#define TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P \
+  gcn_addr_space_legitimate_address_p
+#undef  TARGET_ADDR_SPACE_LEGITIMIZE_ADDRESS
+#define TARGET_ADDR_SPACE_LEGITIMIZE_ADDRESS gcn_addr_space_legitimize_address
+#undef  TARGET_ADDR_SPACE_POINTER_MODE
+#define TARGET_ADDR_SPACE_POINTER_MODE gcn_addr_space_pointer_mode
+#undef  TARGET_ADDR_SPACE_SUBSET_P
+#define TARGET_ADDR_SPACE_SUBSET_P gcn_addr_space_subset_p
+#undef  TARGET_ADDR_SPACE_CONVERT
+#define TARGET_ADDR_SPACE_CONVERT gcn_addr_space_convert
+#undef  TARGET_ARG_PARTIAL_BYTES
+#define TARGET_ARG_PARTIAL_BYTES gcn_arg_partial_bytes
+#undef  TARGET_ASM_ALIGNED_DI_OP
+#define TARGET_ASM_ALIGNED_DI_OP "\t.8byte\t"
+#undef  TARGET_ASM_CONSTRUCTOR
+#define TARGET_ASM_CONSTRUCTOR gcn_disable_constructors
+#undef  TARGET_ASM_DESTRUCTOR
+#define TARGET_ASM_DESTRUCTOR gcn_disable_constructors
+#undef  TARGET_ASM_FILE_START
+#define TARGET_ASM_FILE_START output_file_start
+#undef  TARGET_ASM_FUNCTION_PROLOGUE
+#define TARGET_ASM_FUNCTION_PROLOGUE gcn_target_asm_function_prologue
+#undef  TARGET_ASM_SELECT_SECTION
+#define TARGET_ASM_SELECT_SECTION gcn_asm_select_section
+#undef  TARGET_ASM_TRAMPOLINE_TEMPLATE
+#define TARGET_ASM_TRAMPOLINE_TEMPLATE gcn_asm_trampoline_template
+#undef  TARGET_ATTRIBUTE_TABLE
+#define TARGET_ATTRIBUTE_TABLE gcn_attribute_table
+#undef  TARGET_BUILTIN_DECL
+#define TARGET_BUILTIN_DECL gcn_builtin_decl
+#undef  TARGET_CAN_CHANGE_MODE_CLASS
+#define TARGET_CAN_CHANGE_MODE_CLASS gcn_can_change_mode_class
+#undef  TARGET_CAN_ELIMINATE
+#define TARGET_CAN_ELIMINATE gcn_can_eliminate_p
+#undef  TARGET_CANNOT_COPY_INSN_P
+#define TARGET_CANNOT_COPY_INSN_P gcn_cannot_copy_insn_p
+#undef  TARGET_CLASS_LIKELY_SPILLED_P
+#define TARGET_CLASS_LIKELY_SPILLED_P gcn_class_likely_spilled_p
+#undef  TARGET_CLASS_MAX_NREGS
+#define TARGET_CLASS_MAX_NREGS gcn_class_max_nregs
+#undef  TARGET_CONDITIONAL_REGISTER_USAGE
+#define TARGET_CONDITIONAL_REGISTER_USAGE gcn_conditional_register_usage
+#undef  TARGET_CONSTANT_ALIGNMENT
+#define TARGET_CONSTANT_ALIGNMENT gcn_constant_alignment
+#undef  TARGET_DEBUG_UNWIND_INFO
+#define TARGET_DEBUG_UNWIND_INFO gcn_debug_unwind_info
+#undef  TARGET_EXPAND_BUILTIN
+#define TARGET_EXPAND_BUILTIN gcn_expand_builtin
+#undef  TARGET_FUNCTION_ARG
+#undef  TARGET_FUNCTION_ARG_ADVANCE
+#define TARGET_FUNCTION_ARG_ADVANCE gcn_function_arg_advance
+#define TARGET_FUNCTION_ARG gcn_function_arg
+#undef  TARGET_FUNCTION_VALUE
+#define TARGET_FUNCTION_VALUE gcn_function_value
+#undef  TARGET_FUNCTION_VALUE_REGNO_P
+#define TARGET_FUNCTION_VALUE_REGNO_P gcn_function_value_regno_p
+#undef  TARGET_GIMPLIFY_VA_ARG_EXPR
+#define TARGET_GIMPLIFY_VA_ARG_EXPR gcn_gimplify_va_arg_expr
+#undef  TARGET_GOACC_ADJUST_PROPAGATION_RECORD
+#define TARGET_GOACC_ADJUST_PROPAGATION_RECORD \
+  gcn_goacc_adjust_propagation_record
+#undef  TARGET_GOACC_ADJUST_GANGPRIVATE_DECL
+#define TARGET_GOACC_ADJUST_GANGPRIVATE_DECL gcn_goacc_adjust_gangprivate_decl
+#undef  TARGET_GOACC_FORK_JOIN
+#define TARGET_GOACC_FORK_JOIN gcn_fork_join
+#undef  TARGET_GOACC_REDUCTION
+#define TARGET_GOACC_REDUCTION gcn_goacc_reduction
+#undef  TARGET_GOACC_VALIDATE_DIMS
+#define TARGET_GOACC_VALIDATE_DIMS gcn_goacc_validate_dims
+#undef  TARGET_GOACC_WORKER_PARTITIONING
+#define TARGET_GOACC_WORKER_PARTITIONING true
+#undef  TARGET_HARD_REGNO_MODE_OK
+#define TARGET_HARD_REGNO_MODE_OK gcn_hard_regno_mode_ok
+#undef  TARGET_HARD_REGNO_NREGS
+#define TARGET_HARD_REGNO_NREGS gcn_hard_regno_nregs
+#undef  TARGET_INIT_BUILTINS
+#define TARGET_INIT_BUILTINS gcn_init_builtins
+#undef  TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
+#define TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS \
+  gcn_ira_change_pseudo_allocno_class
+#undef  TARGET_LEGITIMATE_COMBINED_INSN
+#define TARGET_LEGITIMATE_COMBINED_INSN gcn_legitimate_combined_insn
+#undef  TARGET_LEGITIMATE_CONSTANT_P
+#define TARGET_LEGITIMATE_CONSTANT_P gcn_legitimate_constant_p
+#undef  TARGET_LRA_P
+#define TARGET_LRA_P hook_bool_void_true
+#undef  TARGET_MACHINE_DEPENDENT_REORG
+#define TARGET_MACHINE_DEPENDENT_REORG gcn_md_reorg
+#undef  TARGET_MANGLE_DECL_ASSEMBLER_NAME
+#define TARGET_MANGLE_DECL_ASSEMBLER_NAME gcn_mangle_decl_assembler_name
+#undef  TARGET_MEMORY_MOVE_COST
+#define TARGET_MEMORY_MOVE_COST gcn_memory_move_cost
+#undef  TARGET_MODES_TIEABLE_P
+#define TARGET_MODES_TIEABLE_P gcn_modes_tieable_p
+#undef  TARGET_OPTION_OVERRIDE
+#define TARGET_OPTION_OVERRIDE gcn_option_override
+#undef  TARGET_PRETEND_OUTGOING_VARARGS_NAMED
+#define TARGET_PRETEND_OUTGOING_VARARGS_NAMED \
+  gcn_pretend_outgoing_varargs_named
+#undef  TARGET_PROMOTE_FUNCTION_MODE
+#define TARGET_PROMOTE_FUNCTION_MODE gcn_promote_function_mode
+#undef  TARGET_REGISTER_MOVE_COST
+#define TARGET_REGISTER_MOVE_COST gcn_register_move_cost
+#undef  TARGET_RETURN_IN_MEMORY
+#define TARGET_RETURN_IN_MEMORY gcn_return_in_memory
+#undef  TARGET_RTX_COSTS
+#define TARGET_RTX_COSTS gcn_rtx_costs
+#undef  TARGET_SECONDARY_RELOAD
+#define TARGET_SECONDARY_RELOAD gcn_secondary_reload
+#undef  TARGET_SECTION_TYPE_FLAGS
+#define TARGET_SECTION_TYPE_FLAGS gcn_section_type_flags
+#undef  TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P
+#define TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P \
+  gcn_small_register_classes_for_mode_p
+#undef  TARGET_SPILL_CLASS
+#define TARGET_SPILL_CLASS gcn_spill_class
+#undef  TARGET_STRICT_ARGUMENT_NAMING
+#define TARGET_STRICT_ARGUMENT_NAMING gcn_strict_argument_naming
+#undef  TARGET_TRAMPOLINE_INIT
+#define TARGET_TRAMPOLINE_INIT gcn_trampoline_init
+#undef  TARGET_TRULY_NOOP_TRUNCATION
+#define TARGET_TRULY_NOOP_TRUNCATION gcn_truly_noop_truncation
+#undef  TARGET_VECTOR_ALIGNMENT
+#define TARGET_VECTOR_ALIGNMENT gcn_vector_alignment
+#undef  TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST
+#define TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST gcn_vectorization_cost
+#undef  TARGET_VECTORIZE_GET_MASK_MODE
+#define TARGET_VECTORIZE_GET_MASK_MODE gcn_vectorize_get_mask_mode
+#undef  TARGET_VECTORIZE_PREFERRED_SIMD_MODE
+#define TARGET_VECTORIZE_PREFERRED_SIMD_MODE gcn_vectorize_preferred_simd_mode
+#undef  TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT
+#define TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT \
+  gcn_vectorize_support_vector_misalignment
+#undef  TARGET_VECTORIZE_VEC_PERM_CONST
+#define TARGET_VECTORIZE_VEC_PERM_CONST gcn_vectorize_vec_perm_const
+#undef  TARGET_VECTORIZE_VECTOR_ALIGNMENT_REACHABLE
+#define TARGET_VECTORIZE_VECTOR_ALIGNMENT_REACHABLE \
+  gcn_vector_alignment_reachable
+#undef  TARGET_VECTOR_MODE_SUPPORTED_P
+#define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p
+
+struct gcc_target targetm = TARGET_INITIALIZER;
+
+#include "gt-gcn.h"
+/* }}}  */
diff --git a/gcc/config/gcn/gcn.h b/gcc/config/gcn/gcn.h
new file mode 100644
index 0000000..74f0773
--- /dev/null
+++ b/gcc/config/gcn/gcn.h
@@ -0,0 +1,670 @@
+/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your option)
+   any later version.
+
+   This file is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config/gcn/gcn-opts.h"
+
+#define TARGET_CPU_CPP_BUILTINS()	\
+  do					\
+    {					\
+      builtin_define ("__AMDGCN__");	\
+    }					\
+  while(0)
+
+/* Support for a compile-time default architecture and tuning.
+   The rules are:
+   --with-arch is ignored if -march is specified.
+   --with-tune is ignored if -mtune is specified.  */
+#define OPTION_DEFAULT_SPECS		    \
+  {"arch", "%{!march=*:-march=%(VALUE)}" }, \
+  {"tune", "%{!mtune=*:-mtune=%(VALUE)}" }
+
+/* Default target_flags if no switches specified.  */
+#ifndef TARGET_DEFAULT
+#define TARGET_DEFAULT 0
+#endif
+
+\f
+/* Storage Layout */
+#define BITS_BIG_ENDIAN  0
+#define BYTES_BIG_ENDIAN 0
+#define WORDS_BIG_ENDIAN 0
+
+#define BITS_PER_WORD 32
+#define UNITS_PER_WORD (BITS_PER_WORD/BITS_PER_UNIT)
+#define LIBGCC2_UNITS_PER_WORD 4
+
+#define POINTER_SIZE	     64
+#define PARM_BOUNDARY	     64
+#define STACK_BOUNDARY	     64
+#define FUNCTION_BOUNDARY    32
+#define BIGGEST_ALIGNMENT    64
+#define EMPTY_FIELD_BOUNDARY 32
+#define MAX_FIXED_MODE_SIZE  64
+#define MAX_REGS_PER_ADDRESS 2
+#define STACK_SIZE_MODE      DImode
+#define Pmode		     DImode
+#define CASE_VECTOR_MODE     DImode
+#define FUNCTION_MODE	     QImode
+
+#define DATA_ALIGNMENT(TYPE,ALIGN) ((ALIGN) > 128 ? (ALIGN) : 128)
+#define LOCAL_ALIGNMENT(TYPE,ALIGN) ((ALIGN) > 64 ? (ALIGN) : 64)
+#define STACK_SLOT_ALIGNMENT(TYPE,MODE,ALIGN) ((ALIGN) > 64 ? (ALIGN) : 64)
+#define STRICT_ALIGNMENT 1
+
+/* Type Layout: match what x86_64 does.  */
+#define INT_TYPE_SIZE		  32
+#define LONG_TYPE_SIZE		  64
+#define LONG_LONG_TYPE_SIZE	  64
+#define FLOAT_TYPE_SIZE		  32
+#define DOUBLE_TYPE_SIZE	  64
+#define LONG_DOUBLE_TYPE_SIZE	  64
+#define DEFAULT_SIGNED_CHAR	  1
+#define PCC_BITFIELD_TYPE_MATTERS 1
+
+/* Frame Layout */
+#define FRAME_GROWS_DOWNWARD	     0
+#define ARGS_GROW_DOWNWARD	     1
+#define STACK_POINTER_OFFSET	     0
+#define FIRST_PARM_OFFSET(FNDECL)    0
+#define DYNAMIC_CHAIN_ADDRESS(FP)    plus_constant (Pmode, (FP), -16)
+#define INCOMING_RETURN_ADDR_RTX     gen_rtx_REG (Pmode, LINK_REGNUM)
+#define STACK_DYNAMIC_OFFSET(FNDECL) (-crtl->outgoing_args_size)
+#define ACCUMULATE_OUTGOING_ARGS     1
+#define RETURN_ADDR_RTX(COUNT,FRAMEADDR) \
+  ((COUNT) == 0 ? get_hard_reg_initial_val (Pmode, LINK_REGNUM) : NULL_RTX)
+\f
+/* Register Basics */
+#define FIRST_SGPR_REG	    0
+#define SGPR_REGNO(N)	    ((N)+FIRST_SGPR_REG)
+#define LAST_SGPR_REG	    101
+
+#define FLAT_SCRATCH_REG    102
+#define FLAT_SCRATCH_LO_REG 102
+#define FLAT_SCRATCH_HI_REG 103
+#define XNACK_MASK_REG	    104
+#define XNACK_MASK_LO_REG   104
+#define XNACK_MASK_HI_REG   105
+#define VCC_LO_REG	    106
+#define VCC_HI_REG	    107
+#define VCCZ_REG	    108
+#define TBA_REG		    109
+#define TBA_LO_REG	    109
+#define TBA_HI_REG	    110
+#define TMA_REG		    111
+#define TMA_LO_REG	    111
+#define TMA_HI_REG	    112
+#define TTMP0_REG	    113
+#define TTMP11_REG	    124
+#define M0_REG		    125
+#define EXEC_REG	    126
+#define EXEC_LO_REG	    126
+#define EXEC_HI_REG	    127
+#define EXECZ_REG	    128
+#define SCC_REG		    129
+/* 132-159 are reserved to simplify masks.  */
+#define FIRST_VGPR_REG	    160
+#define VGPR_REGNO(N)	    ((N)+FIRST_VGPR_REG)
+#define LAST_VGPR_REG	    415
+
+/* Frame Registers, and other registers */
+
+#define HARD_FRAME_POINTER_REGNUM 14
+#define STACK_POINTER_REGNUM	  16
+#define LINK_REGNUM		  18
+#define EXEC_SAVE_REG		  20
+#define CC_SAVE_REG		  22
+#define RETURN_VALUE_REG	  24	/* Must be divisible by 4.  */
+#define STATIC_CHAIN_REGNUM	  30
+#define WORK_ITEM_ID_Z_REG	  162
+#define SOFT_ARG_REG		  416
+#define FRAME_POINTER_REGNUM	  418
+#define FIRST_PSEUDO_REGISTER	  420
+
+#define FIRST_PARM_REG 24
+#define NUM_PARM_REGS  6
+
+/* There is no arg pointer.  Just choose random fixed register that does
+   not intefere with anything.  */
+#define ARG_POINTER_REGNUM SOFT_ARG_REG
+
+#define HARD_FRAME_POINTER_IS_ARG_POINTER   0
+#define HARD_FRAME_POINTER_IS_FRAME_POINTER 0
+
+#define SGPR_OR_VGPR_REGNO_P(N) ((N)>=FIRST_VGPR_REG && (N) <= LAST_SGPR_REG)
+#define SGPR_REGNO_P(N)		((N) <= LAST_SGPR_REG)
+#define VGPR_REGNO_P(N)		((N)>=FIRST_VGPR_REG && (N) <= LAST_VGPR_REG)
+#define SSRC_REGNO_P(N)		((N) <= SCC_REG && (N) != VCCZ_REG)
+#define SDST_REGNO_P(N)		((N) <= EXEC_HI_REG && (N) != VCCZ_REG)
+#define CC_REG_P(X)		(REG_P (X) && CC_REGNO_P (REGNO (X)))
+#define CC_REGNO_P(X)		((X) == SCC_REG || (X) == VCC_REG)
+#define FUNCTION_ARG_REGNO_P(N) \
+  ((N) >= FIRST_PARM_REG && (N) < (FIRST_PARM_REG + NUM_PARM_REGS))
+
+\f
+#define FIXED_REGISTERS {			    \
+    /* Scalars.  */				    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,		    \
+/*		fp    sp    lr.  */		    \
+    0, 0, 0, 0, 1, 1, 1, 1, 0, 0,		    \
+/*  exec_save, cc_save */			    \
+    1, 1, 1, 1, 0, 0, 0, 0, 0, 0,		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,		    \
+    0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,		    \
+    /* Special regs and padding.  */		    \
+/*  flat  xnack vcc	 tba   tma   ttmp */	    \
+    1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
+/*			 m0 exec     scc */	    \
+    1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1,		    \
+    /* VGRPs */					    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    /* Other registers.  */			    \
+    1, 1, 1, 1					    \
+}
+
+#define CALL_USED_REGISTERS {			    \
+    /* Scalars.  */				    \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 		    \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 		    \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 		    \
+    1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 		    \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 		    \
+    0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,		    \
+    /* Special regs and padding.  */		    \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1,		    \
+    /* VGRPs */					    \
+    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+    /* Other registers.  */			    \
+    1, 1, 1, 1					    \
+}
+
+\f
+/* This returns true if the register has a special purpose on the
+   architecture, but is not fixed.  */
+#define SPECIAL_REGNO_P(REGNO)                                          \
+  ((REGNO) == SCC_REG || (REGNO) == VCC_LO_REG || (REGNO) == VCC_HI_REG \
+   || (REGNO) == EXEC_LO_REG || (REGNO) == EXEC_HI_REG)
+
+#define HARD_REGNO_RENAME_OK(FROM, TO) \
+  gcn_hard_regno_rename_ok (FROM, TO)
+
+#define HARD_REGNO_CALLER_SAVE_MODE(HARDREG, NREGS, MODE) \
+  gcn_hard_regno_caller_save_mode ((HARDREG), (NREGS), (MODE))
+
+/* Register Classes */
+
+enum reg_class
+{
+  NO_REGS,
+
+  /* SCC */
+  SCC_CONDITIONAL_REG,
+
+  /* VCCZ */
+  VCCZ_CONDITIONAL_REG,
+
+  /* VCC */
+  VCC_CONDITIONAL_REG,
+
+  /* EXECZ */
+  EXECZ_CONDITIONAL_REG,
+
+  /* SCC VCCZ EXECZ */
+  ALL_CONDITIONAL_REGS,
+
+  /* EXEC */
+  EXEC_MASK_REG,
+
+  /* SGPR0-101 */
+  SGPR_REGS,
+
+  /* SGPR0-101 EXEC_LO/EXEC_HI */
+  SGPR_EXEC_REGS,
+
+  /* SGPR0-101, VCC LO/HI, TBA LO/HI, TMA LO/HI, TTMP0-11, M0, EXEC LO/HI,
+     VCCZ, EXECZ, SCC
+     FIXME: Maybe manual has bug and FLAT_SCRATCH is OK.  */
+  SGPR_VOP3A_SRC_REGS,
+
+  /* SGPR0-101, FLAT_SCRATCH_LO/HI, XNACK_MASK_LO/HI, VCC LO/HI, TBA LO/HI
+     TMA LO/HI, TTMP0-11 */
+  SGPR_MEM_SRC_REGS,
+
+  /* SGPR0-101, FLAT_SCRATCH_LO/HI, XNACK_MASK_LO/HI, VCC LO/HI, TBA LO/HI
+     TMA LO/HI, TTMP0-11, M0, EXEC LO/HI */
+  SGPR_DST_REGS,
+
+  /* SGPR0-101, FLAT_SCRATCH_LO/HI, XNACK_MASK_LO/HI, VCC LO/HI, TBA LO/HI
+     TMA LO/HI, TTMP0-11 */
+  SGPR_SRC_REGS,
+  GENERAL_REGS,
+  VGPR_REGS,
+  ALL_GPR_REGS,
+  SRCDST_REGS,
+  AFP_REGS,
+  ALL_REGS,
+  LIM_REG_CLASSES
+};
+
+#define N_REG_CLASSES (int) LIM_REG_CLASSES
+
+#define REG_CLASS_NAMES     \
+{  "NO_REGS",		    \
+   "SCC_CONDITIONAL_REG",   \
+   "VCCZ_CONDITIONAL_REG",  \
+   "VCC_CONDITIONAL_REG",   \
+   "EXECZ_CONDITIONAL_REG", \
+   "ALL_CONDITIONAL_REGS",  \
+   "EXEC_MASK_REG",	    \
+   "SGPR_REGS",		    \
+   "SGPR_EXEC_REGS",	    \
+   "SGPR_VOP3A_SRC_REGS",   \
+   "SGPR_MEM_SRC_REGS",     \
+   "SGPR_DST_REGS",	    \
+   "SGPR_SRC_REGS",	    \
+   "GENERAL_REGS",	    \
+   "VGPR_REGS",		    \
+   "ALL_GPR_REGS",	    \
+   "SRCDST_REGS",	    \
+   "AFP_REGS",		    \
+   "ALL_REGS"		    \
+}
+
+#define NAMED_REG_MASK(N)  (1<<((N)-3*32))
+#define NAMED_REG_MASK2(N) (1<<((N)-4*32))
+
+#define REG_CLASS_CONTENTS {						   \
+    /* NO_REGS.  */							   \
+    {0, 0, 0, 0,							   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* SCC_CONDITIONAL_REG.  */						   \
+    {0, 0, 0, 0,							   \
+     NAMED_REG_MASK2 (SCC_REG), 0, 0, 0,				   \
+     0, 0, 0, 0, 0},							   \
+    /* VCCZ_CONDITIONAL_REG.  */					   \
+    {0, 0, 0, NAMED_REG_MASK (VCCZ_REG),				   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* VCC_CONDITIONAL_REG.  */						   \
+    {0, 0, 0, NAMED_REG_MASK (VCC_LO_REG)|NAMED_REG_MASK (VCC_HI_REG),	   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* EXECZ_CONDITIONAL_REG.  */					   \
+    {0, 0, 0, 0,							   \
+     NAMED_REG_MASK2 (EXECZ_REG), 0, 0, 0,				   \
+     0, 0, 0, 0, 0},							   \
+    /* ALL_CONDITIONAL_REGS.  */					   \
+    {0, 0, 0, NAMED_REG_MASK (VCCZ_REG),				   \
+     NAMED_REG_MASK2 (EXECZ_REG) | NAMED_REG_MASK2 (SCC_REG), 0, 0, 0,	   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* EXEC_MASK_REG.  */						   \
+    {0, 0, 0, NAMED_REG_MASK (EXEC_LO_REG) | NAMED_REG_MASK (EXEC_HI_REG), \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* SGPR_REGS.  */							   \
+    {0xffffffff, 0xffffffff, 0xffffffff, 0xf1,				   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* SGPR_EXEC_REGS.	*/						   \
+    {0xffffffff, 0xffffffff, 0xffffffff,				   \
+      0xf1 | NAMED_REG_MASK (EXEC_LO_REG) | NAMED_REG_MASK (EXEC_HI_REG),  \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* SGPR_VOP3A_SRC_REGS.  */						   \
+    {0xffffffff, 0xffffffff, 0xffffffff,				   \
+      0xffffffff							   \
+       -NAMED_REG_MASK (FLAT_SCRATCH_LO_REG)				   \
+       -NAMED_REG_MASK (FLAT_SCRATCH_HI_REG)				   \
+       -NAMED_REG_MASK (XNACK_MASK_LO_REG)				   \
+       -NAMED_REG_MASK (XNACK_MASK_HI_REG),				   \
+     NAMED_REG_MASK2 (EXECZ_REG) | NAMED_REG_MASK2 (SCC_REG), 0, 0, 0,	   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* SGPR_MEM_SRC_REGS.  */						   \
+    {0xffffffff, 0xffffffff, 0xffffffff,				   \
+     0xffffffff-NAMED_REG_MASK (VCCZ_REG)-NAMED_REG_MASK (M0_REG)	   \
+     -NAMED_REG_MASK (EXEC_LO_REG)-NAMED_REG_MASK (EXEC_HI_REG),	   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* SGPR_DST_REGS.  */						   \
+    {0xffffffff, 0xffffffff, 0xffffffff,				   \
+     0xffffffff-NAMED_REG_MASK (VCCZ_REG),				   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* SGPR_SRC_REGS.  */						   \
+    {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,			   \
+     NAMED_REG_MASK2 (EXECZ_REG) | NAMED_REG_MASK2 (SCC_REG), 0, 0, 0,	   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* GENERAL_REGS.  */						   \
+    {0xffffffff, 0xffffffff, 0xffffffff, 0xf1,				   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0},							   \
+    /* VGPR_REGS.  */							   \
+    {0, 0, 0, 0,							   \
+     0,		 0xffffffff, 0xffffffff, 0xffffffff,			   \
+     0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0},	   \
+    /* ALL_GPR_REGS.  */						   \
+    {0xffffffff, 0xffffffff, 0xffffffff, 0xf1,				   \
+     0,		 0xffffffff, 0xffffffff, 0xffffffff,			   \
+     0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0},	   \
+    /* SRCDST_REGS.  */							   \
+    {0xffffffff, 0xffffffff, 0xffffffff,				   \
+     0xffffffff-NAMED_REG_MASK (VCCZ_REG),				   \
+     0,		 0xffffffff, 0xffffffff, 0xffffffff,			   \
+     0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0},	   \
+    /* AFP_REGS.  */							   \
+    {0, 0, 0, 0,							   \
+     0, 0, 0, 0,							   \
+     0, 0, 0, 0, 0, 0xf},						   \
+    /* ALL_REGS.  */							   \
+    {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,			   \
+     0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,			   \
+     0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0 }}
+
+#define REGNO_REG_CLASS(REGNO) gcn_regno_reg_class (REGNO)
+#define MODE_CODE_BASE_REG_CLASS(MODE, AS, OUTER, INDEX) \
+	 gcn_mode_code_base_reg_class (MODE, AS, OUTER, INDEX)
+#define REGNO_MODE_CODE_OK_FOR_BASE_P(NUM, MODE, AS, OUTER, INDEX) \
+	 gcn_regno_mode_code_ok_for_base_p (NUM, MODE, AS, OUTER, INDEX)
+#define INDEX_REG_CLASS VGPR_REGS
+#define REGNO_OK_FOR_INDEX_P(regno) regno_ok_for_index_p (regno)
+
+\f
+/* Address spaces.  */
+enum gcn_address_spaces
+{
+  ADDR_SPACE_DEFAULT = 0,
+  ADDR_SPACE_FLAT,
+  ADDR_SPACE_SCALAR_FLAT,
+  ADDR_SPACE_FLAT_SCRATCH,
+  ADDR_SPACE_LDS,
+  ADDR_SPACE_GDS,
+  ADDR_SPACE_SCRATCH,
+  ADDR_SPACE_GLOBAL
+};
+#define REGISTER_TARGET_PRAGMAS() do {                               \
+  c_register_addr_space ("__flat", ADDR_SPACE_FLAT);                 \
+  c_register_addr_space ("__flat_scratch", ADDR_SPACE_FLAT_SCRATCH); \
+  c_register_addr_space ("__scalar_flat", ADDR_SPACE_SCALAR_FLAT);   \
+  c_register_addr_space ("__lds", ADDR_SPACE_LDS);                   \
+  c_register_addr_space ("__gds", ADDR_SPACE_GDS);                   \
+  c_register_addr_space ("__global", ADDR_SPACE_GLOBAL);             \
+} while (0);
+
+#define STACK_ADDR_SPACE \
+  (TARGET_GCN5_PLUS ? ADDR_SPACE_GLOBAL : ADDR_SPACE_FLAT)
+#define DEFAULT_ADDR_SPACE \
+  ((cfun && cfun->machine && !cfun->machine->use_flat_addressing) \
+   ? ADDR_SPACE_GLOBAL : ADDR_SPACE_FLAT)
+#define AS_SCALAR_FLAT_P(AS)   ((AS) == ADDR_SPACE_SCALAR_FLAT)
+#define AS_FLAT_SCRATCH_P(AS)  ((AS) == ADDR_SPACE_FLAT_SCRATCH)
+#define AS_FLAT_P(AS)	       ((AS) == ADDR_SPACE_FLAT \
+				|| ((AS) == ADDR_SPACE_DEFAULT \
+				    && DEFAULT_ADDR_SPACE == ADDR_SPACE_FLAT))
+#define AS_LDS_P(AS)	       ((AS) == ADDR_SPACE_LDS)
+#define AS_GDS_P(AS)	       ((AS) == ADDR_SPACE_GDS)
+#define AS_SCRATCH_P(AS)       ((AS) == ADDR_SPACE_SCRATCH)
+#define AS_GLOBAL_P(AS)        ((AS) == ADDR_SPACE_GLOBAL \
+				|| ((AS) == ADDR_SPACE_DEFAULT \
+				    && DEFAULT_ADDR_SPACE == ADDR_SPACE_GLOBAL))
+#define AS_ANY_FLAT_P(AS)      (AS_FLAT_SCRATCH_P (AS) || AS_FLAT_P (AS))
+#define AS_ANY_DS_P(AS)	       (AS_LDS_P (AS) || AS_GDS_P (AS))
+
+\f
+/* Instruction Output */
+#define REGISTER_NAMES							    \
+   {"s0", "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s10",	    \
+    "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19", "s20",   \
+    "s21", "s22", "s23", "s24", "s25", "s26", "s27", "s28", "s29", "s30",   \
+    "s31", "s32", "s33", "s34", "s35", "s36", "s37", "s38", "s39", "s40",   \
+    "s41", "s42", "s43", "s44", "s45", "s46", "s47", "s48", "s49", "s50",   \
+    "s51", "s52", "s53", "s54", "s55", "s56", "s57", "s58", "s59", "s60",   \
+    "s61", "s62", "s63", "s64", "s65", "s66", "s67", "s68", "s69", "s70",   \
+    "s71", "s72", "s73", "s74", "s75", "s76", "s77", "s78", "s79", "s80",   \
+    "s81", "s82", "s83", "s84", "s85", "s86", "s87", "s88", "s89", "s90",   \
+    "s91", "s92", "s93", "s94", "s95", "s96", "s97", "s98", "s99",	    \
+    "s100", "s101",							    \
+    "flat_scratch_lo", "flat_scratch_hi", "xnack_mask_lo", "xnack_mask_hi", \
+    "vcc_lo", "vcc_hi", "vccz", "tba_lo", "tba_hi", "tma_lo", "tma_hi",     \
+    "ttmp0", "ttmp1", "ttmp2", "ttmp3", "ttmp4", "ttmp5", "ttmp6", "ttmp7", \
+    "ttmp8", "ttmp9", "ttmp10", "ttmp11", "m0", "exec_lo", "exec_hi",	    \
+    "execz", "scc",							    \
+    "res130", "res131", "res132", "res133", "res134", "res135", "res136",   \
+    "res137", "res138", "res139", "res140", "res141", "res142", "res143",   \
+    "res144", "res145", "res146", "res147", "res148", "res149", "res150",   \
+    "res151", "res152", "res153", "res154", "res155", "res156", "res157",   \
+    "res158", "res159",							    \
+    "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",	    \
+    "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",   \
+    "v21", "v22", "v23", "v24", "v25", "v26", "v27", "v28", "v29", "v30",   \
+    "v31", "v32", "v33", "v34", "v35", "v36", "v37", "v38", "v39", "v40",   \
+    "v41", "v42", "v43", "v44", "v45", "v46", "v47", "v48", "v49", "v50",   \
+    "v51", "v52", "v53", "v54", "v55", "v56", "v57", "v58", "v59", "v60",   \
+    "v61", "v62", "v63", "v64", "v65", "v66", "v67", "v68", "v69", "v70",   \
+    "v71", "v72", "v73", "v74", "v75", "v76", "v77", "v78", "v79", "v80",   \
+    "v81", "v82", "v83", "v84", "v85", "v86", "v87", "v88", "v89", "v90",   \
+    "v91", "v92", "v93", "v94", "v95", "v96", "v97", "v98", "v99", "v100",  \
+    "v101", "v102", "v103", "v104", "v105", "v106", "v107", "v108", "v109", \
+    "v110", "v111", "v112", "v113", "v114", "v115", "v116", "v117", "v118", \
+    "v119", "v120", "v121", "v122", "v123", "v124", "v125", "v126", "v127", \
+    "v128", "v129", "v130", "v131", "v132", "v133", "v134", "v135", "v136", \
+    "v137", "v138", "v139", "v140", "v141", "v142", "v143", "v144", "v145", \
+    "v146", "v147", "v148", "v149", "v150", "v151", "v152", "v153", "v154", \
+    "v155", "v156", "v157", "v158", "v159", "v160", "v161", "v162", "v163", \
+    "v164", "v165", "v166", "v167", "v168", "v169", "v170", "v171", "v172", \
+    "v173", "v174", "v175", "v176", "v177", "v178", "v179", "v180", "v181", \
+    "v182", "v183", "v184", "v185", "v186", "v187", "v188", "v189", "v190", \
+    "v191", "v192", "v193", "v194", "v195", "v196", "v197", "v198", "v199", \
+    "v200", "v201", "v202", "v203", "v204", "v205", "v206", "v207", "v208", \
+    "v209", "v210", "v211", "v212", "v213", "v214", "v215", "v216", "v217", \
+    "v218", "v219", "v220", "v221", "v222", "v223", "v224", "v225", "v226", \
+    "v227", "v228", "v229", "v230", "v231", "v232", "v233", "v234", "v235", \
+    "v236", "v237", "v238", "v239", "v240", "v241", "v242", "v243", "v244", \
+    "v245", "v246", "v247", "v248", "v249", "v250", "v251", "v252", "v253", \
+    "v254", "v255",							    \
+    "?ap0", "?ap1", "?fp0", "?fp1" }
+
+#define PRINT_OPERAND(FILE, X, CODE)  print_operand(FILE, X, CODE)
+#define PRINT_OPERAND_ADDRESS(FILE, ADDR)  print_operand_address (FILE, ADDR)
+#define PRINT_OPERAND_PUNCT_VALID_P(CODE) (CODE == '^')
+
+\f
+/* Register Arguments */
+
+#ifndef USED_FOR_TARGET
+
+#define GCN_KERNEL_ARG_TYPES 19
+struct GTY(()) gcn_kernel_args
+{
+  long requested;
+  int reg[GCN_KERNEL_ARG_TYPES];
+  int order[GCN_KERNEL_ARG_TYPES];
+  int nargs, nsgprs;
+};
+
+typedef struct gcn_args
+{
+  /* True if this isn't a kernel (HSA runtime entrypoint).  */
+  bool normal_function;
+  tree fntype;
+  struct gcn_kernel_args args;
+  int num;
+  int offset;
+  int alignment;
+} CUMULATIVE_ARGS;
+#endif
+
+#define INIT_CUMULATIVE_ARGS(CUM,FNTYPE,LIBNAME,FNDECL,N_NAMED_ARGS) \
+  gcn_init_cumulative_args (&(CUM), (FNTYPE), (LIBNAME), (FNDECL),   \
+			    (N_NAMED_ARGS) != -1)
+
+\f
+#ifndef USED_FOR_TARGET
+
+#include "hash-table.h"
+#include "hash-map.h"
+#include "vec.h"
+
+struct GTY(()) machine_function
+{
+  struct gcn_kernel_args args;
+  int kernarg_segment_alignment;
+  int kernarg_segment_byte_size;
+  /* Frame layout info for normal functions.  */
+  bool normal_function;
+  bool need_frame_pointer;
+  bool lr_needs_saving;
+  HOST_WIDE_INT outgoing_args_size;
+  HOST_WIDE_INT pretend_size;
+  HOST_WIDE_INT local_vars;
+  HOST_WIDE_INT callee_saves;
+
+  unsigned lds_allocated;
+  hash_map<tree, int> *lds_allocs;
+
+  vec<tree, va_gc> *reduc_decls;
+
+  bool use_flat_addressing;
+};
+#endif
+
+\f
+/* Codes for all the GCN builtins.  */
+
+enum gcn_builtin_codes
+{
+#define DEF_BUILTIN(fcode, icode, name, type, params, expander) \
+  GCN_BUILTIN_ ## fcode,
+#define DEF_BUILTIN_BINOP_INT_FP(fcode, ic, name)	\
+  GCN_BUILTIN_ ## fcode ## _V64SI,			\
+  GCN_BUILTIN_ ## fcode ## _V64SI_unspec,
+#include "gcn-builtins.def"
+#undef DEF_BUILTIN
+#undef DEF_BUILTIN_BINOP_INT_FP
+  GCN_BUILTIN_MAX
+};
+
+\f
+/* Misc */
+
+/* We can load/store 128-bit quantities, but having this larger than
+   MAX_FIXED_MODE_SIZE (which we want to be 64 bits) causes problems.  */
+#define MOVE_MAX 8
+
+#define AVOID_CCMODE_COPIES 1
+#define SLOW_BYTE_ACCESS 0
+#define WORD_REGISTER_OPERATIONS 1
+
+/* Definitions for register eliminations.
+
+   This is an array of structures.  Each structure initializes one pair
+   of eliminable registers.  The "from" register number is given first,
+   followed by "to".  Eliminations of the same "from" register are listed
+   in order of preference.  */
+
+#define ELIMINABLE_REGS					\
+{{ ARG_POINTER_REGNUM, STACK_POINTER_REGNUM },		\
+ { ARG_POINTER_REGNUM, HARD_FRAME_POINTER_REGNUM },	\
+ { FRAME_POINTER_REGNUM, STACK_POINTER_REGNUM },	\
+ { FRAME_POINTER_REGNUM, HARD_FRAME_POINTER_REGNUM }}
+
+/* Define the offset between two registers, one to be eliminated, and the
+   other its replacement, at the start of a routine.  */
+
+#define INITIAL_ELIMINATION_OFFSET(FROM, TO, OFFSET)	\
+  ((OFFSET) = gcn_initial_elimination_offset ((FROM), (TO)))
+
+
+/* Define this macro if it is advisable to hold scalars in registers
+   in a wider mode than that declared by the program.  In such cases,
+   the value is constrained to be within the bounds of the declared
+   type, but kept valid in the wider mode.  The signedness of the
+   extension may differ from that of the type.  */
+
+#define PROMOTE_MODE(MODE,UNSIGNEDP,TYPE)			\
+  if (GET_MODE_CLASS (MODE) == MODE_INT				\
+      && (TYPE == NULL || TREE_CODE (TYPE) != VECTOR_TYPE)	\
+      && GET_MODE_SIZE (MODE) < UNITS_PER_WORD)			\
+    {								\
+      (MODE) = SImode;						\
+    }
+
+/* This needs to match gcn_function_value.  */
+#define LIBCALL_VALUE(MODE) gen_rtx_REG (MODE, SGPR_REGNO (RETURN_VALUE_REG))
+
+\f
+/* Costs.  */
+
+/* Branches are to be dicouraged when theres an alternative.
+   FIXME: This number is plucked from the air.  */
+#define BRANCH_COST(SPEED_P, PREDICABLE_P) 10
+
+\f
+/* Profiling */
+#define FUNCTION_PROFILER(FILE, LABELNO)
+#define NO_PROFILE_COUNTERS 1
+#define PROFILE_BEFORE_PROLOGUE 0
+
+/* Trampolines */
+#define TRAMPOLINE_SIZE 36
+#define TRAMPOLINE_ALIGNMENT 64
+
+/* Disable the "current_vector_size" feature intended for
+   AVX<->SSE switching.  */
+#define TARGET_DISABLE_CURRENT_VECTOR_SIZE
diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
new file mode 100644
index 0000000..cf96e24
--- /dev/null
+++ b/gcc/config/gcn/gcn.md
@@ -0,0 +1,2199 @@
+;; Copyright (C) 2016-2018 Free Software Foundation, Inc.
+
+;; This file is free software; you can redistribute it and/or modify it under
+;; the terms of the GNU General Public License as published by the Free
+;; Software Foundation; either version 3 of the License, or (at your option)
+;; any later version.
+
+;; This file is distributed in the hope that it will be useful, but WITHOUT
+;; ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+;; FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+;; for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING3.  If not see
+;; <http://www.gnu.org/licenses/>.
+
+;;- See file "rtl.def" for documentation on define_insn, match_*, et. al.
+
+(include "predicates.md")
+(include "constraints.md")
+
+;; {{{ Constants and enums
+
+; Named registers
+(define_constants
+  [(FIRST_SGPR_REG		 0)
+   (LAST_SGPR_REG		 101)
+   (FLAT_SCRATCH_REG		 102)
+   (FLAT_SCRATCH_LO_REG		 102)
+   (FLAT_SCRATCH_HI_REG		 103)
+   (XNACK_MASK_REG		 104)
+   (XNACK_MASK_LO_REG		 104)
+   (XNACK_MASK_HI_REG		 105)
+   (VCC_REG			 106)
+   (VCC_LO_REG			 106)
+   (VCC_HI_REG			 107)
+   (VCCZ_REG			 108)
+   (TBA_REG			 109)
+   (TBA_LO_REG			 109)
+   (TBA_HI_REG			 110)
+   (TMA_REG			 111)
+   (TMA_LO_REG			 111)
+   (TMA_HI_REG			 112)
+   (TTMP0_REG			 113)
+   (TTMP11_REG			 124)
+   (M0_REG			 125)
+   (EXEC_REG			 126)
+   (EXEC_LO_REG			 126)
+   (EXEC_HI_REG			 127)
+   (EXECZ_REG			 128)
+   (SCC_REG			 129)
+   (FIRST_VGPR_REG		 160)
+   (LAST_VGPR_REG		 415)])
+
+(define_constants
+  [(SP_REGNUM 16)
+   (LR_REGNUM 18)
+   (AP_REGNUM 416)
+   (FP_REGNUM 418)])
+
+(define_c_enum "unspecv" [
+  UNSPECV_PROLOGUE_USE
+  UNSPECV_KERNEL_RETURN
+  UNSPECV_BARRIER
+  UNSPECV_ATOMIC
+  UNSPECV_ICACHE_INV])
+
+(define_c_enum "unspec" [
+  UNSPEC_VECTOR
+  UNSPEC_BPERMUTE
+  UNSPEC_SGPRBASE
+  UNSPEC_MEMORY_BARRIER
+  UNSPEC_SMIN_DPP_SHR UNSPEC_SMAX_DPP_SHR
+  UNSPEC_UMIN_DPP_SHR UNSPEC_UMAX_DPP_SHR
+  UNSPEC_PLUS_DPP_SHR
+  UNSPEC_PLUS_CARRY_DPP_SHR UNSPEC_PLUS_CARRY_IN_DPP_SHR
+  UNSPEC_AND_DPP_SHR UNSPEC_IOR_DPP_SHR UNSPEC_XOR_DPP_SHR
+  UNSPEC_MOV_FROM_LANE63
+  UNSPEC_GATHER
+  UNSPEC_SCATTER])
+
+;; }}}
+;; {{{ Attributes
+
+; Instruction type (encoding) as described in the ISA specification.
+; The following table summarizes possible operands of individual instruction
+; types and corresponding constraints.
+;
+; sop2 - scalar, two inputs, one output
+;	 ssrc0/ssrc1: sgpr 0-102; flat_scratch,xnack,vcc,tba,tma,ttmp0-11,exec
+;		      vccz,execz,scc,inline immedate,fp inline immediate
+;	 sdst: sgpr 0-102; flat_scratch,xnack,vcc,tba,tma,ttmp0-11,exec
+;
+; 	 Constraints "=SD, SD", "SSA,SSB","SSB,SSA"
+;
+; sopk - scalar, inline constant input, one output
+;	 simm16: 16bit inline constant
+;	 sdst: same as sop2/ssrc0
+;
+;	 Constraints "=SD", "J"
+;
+; sop1 - scalar, one input, one output
+;	 ssrc0: same as sop2/ssrc0.  FIXME: manual omit VCCZ
+;	 sdst: same as sop2/sdst
+;
+;	 Constraints "=SD", "SSA"
+;
+; sopc - scalar, two inputs, one comparsion
+;	 ssrc0: same as sop2/ssc0.
+;
+; 	 Constraints "SSI,SSA","SSA,SSI"
+;
+; sopp - scalar, one constant input, one special
+;	 simm16
+;
+; smem - scalar memory
+;	 sbase: aligned pair of sgprs.  Specify {size[15:0], base[47:0]} in
+;               dwords
+;	 sdata: sgpr0-102, flat_scratch, xnack, vcc, tba, tma
+;	 offset: sgpr or 20bit unsigned byte offset
+;
+; vop2 - vector, two inputs, one output
+;	 vsrc0: sgpr0-102,flat_scratch,xnack,vcc,tba,ttmp0-11,m0,exec,
+;		inline constant -16 to -64, fp inline immediate, vccz, execz,
+;		scc, lds, literal constant, vgpr0-255
+;	 vsrc1: vgpr0-255
+;	 vdst: vgpr0-255
+;	 Limitations: At most one SGPR, at most one constant
+;		      if constant is used, SGPR must be M0
+;		      Only SRC0 can be LDS_DIRECT
+;
+;	 constraints: "=v", "vBSS", "v"
+;
+; vop1 - vector, one input, one output
+;	 vsrc0: same as vop2/src0
+;	 vdst: vgpr0-255
+;
+;	 constraints: "=v", "vBSS"
+;
+; vopc - vector, two inputs, one comparsion output;
+;	 vsrc0: same as vop2/src0
+;	 vsrc1: vgpr0-255
+;	 vdst:
+;
+;	 constraints: "vASS", "v"
+;
+; vop3a - vector, three inputs, one output
+;	 vdst: vgpr0-255, for v_cmp sgpr or vcc
+;	 abs,clamp
+;	 vsrc0: sgpr0-102,vcc,tba,ttmp0-11,m0,exec,
+;		inline constant -16 to -64, fp inline immediate, vccz, execz,
+;		scc, lds_direct
+;		FIXME: really missing 1/pi? really 104 SGPRs
+;
+; vop3b - vector, three inputs, one vector output, one scalar output
+;	 vsrc0,vsrc1,vsrc2: same as vop3a vsrc0
+;	 vdst: vgpr0-255
+;	 sdst: sgpr0-103/vcc/tba/tma/ttmp0-11
+;
+; vop_sdwa - second dword for vop1/vop2/vopc for specifying sub-dword address
+;	 src0: vgpr0-255
+;	 dst_sel: BYTE_0-3, WORD_0-1, DWORD
+;	 dst_unused: UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE
+;	 clamp: true/false
+;	 src0_sel: BYTE_0-3, WORD_0-1, DWORD
+;	 flags: src0_sext, src0_neg, src0_abs, src1_sel, src1_sext, src1_neg,
+  ;		src1_abs
+;
+; vop_dpp - second dword for vop1/vop2/vopc for specifying data-parallel ops
+;	 src0: vgpr0-255
+;	 dpp_ctrl: quad_perm, row_sl0-15, row_sr0-15, row_rr0-15, wf_sl1,
+;		  wf_rl1, wf_sr1, wf_rr1, row_mirror, row_half_mirror,
+;		  bcast15, bcast31
+;	 flags: src0_neg, src0_abs, src1_neg, src1_abs
+;	 bank_mask: 4-bit mask
+;	 row_mask: 4-bit mask
+;
+; ds - Local and global data share instructions.
+;	 offset0: 8-bit constant
+;	 offset1: 8-bit constant
+;	 flag: gds
+;	 addr: vgpr0-255
+;	 data0: vgpr0-255
+;	 data1: vgpr0-255
+;	 vdst: vgpr0-255
+;
+; mubuf - Untyped memory buffer operation. First word with LDS, second word
+;	  non-LDS.
+;	 offset: 12-bit constant
+;	 vaddr: vgpr0-255
+;	 vdata: vgpr0-255
+;	 srsrc: sgpr0-102
+;	 soffset: sgpr0-102
+;	 flags: offen, idxen, glc, lds, slc, tfe
+;
+; mtbuf - Typed memory buffer operation. Two words
+;	 offset: 12-bit constant
+;	 dfmt: 4-bit constant
+;	 nfmt: 3-bit constant
+;	 vaddr: vgpr0-255
+;	 vdata: vgpr0-255
+;	 srsrc: sgpr0-102
+;	 soffset: sgpr0-102
+;	 flags: offen, idxen, glc, lds, slc, tfe
+;
+; flat - flat or global memory operations
+;	 flags: glc, slc
+;	 addr: vgpr0-255
+;	 data: vgpr0-255
+;	 vdst: vgpr0-255
+;
+; mult - expands to multiple instructions (pseudo encoding)
+;
+; vmult - as mult, when a vector instruction is used.
+
+(define_attr "type"
+	     "unknown,sop1,sop2,sopk,sopc,sopp,smem,ds,vop2,vop1,vopc,
+	      vop3a,vop3b,vop_sdwa,vop_dpp,mubuf,mtbuf,flat,mult,vmult"
+	     (const_string "unknown"))
+
+; Set if instruction is executed in scalar or vector unit
+
+(define_attr "unit" "unknown,scalar,vector"
+  (cond [(eq_attr "type" "sop1,sop2,sopk,sopc,sopp,smem,mult")
+	    (const_string "scalar")
+	 (eq_attr "type" "vop2,vop1,vopc,vop3a,vop3b,ds,
+			  vop_sdwa,vop_dpp,flat,vmult")
+	    (const_string "vector")]
+	 (const_string "unknown")))
+
+; All vector instructions run as 64 threads as predicated by the EXEC
+; register.  Scalar operations in vector register require a single lane
+; enabled, vector moves require a full set of lanes enabled, and most vector
+; operations handle the lane masking themselves.
+; The md_reorg pass is responsible for ensuring that EXEC is set appropriately
+; according to the following settings:
+;   auto   - instruction doesn't use EXEC, or handles it itself.
+;            md_reorg will inspect def/use to determine what to do.
+;   single - disable all but lane zero.
+;   full   - enable all lanes.
+
+(define_attr "exec" "auto,single,full"
+   (const_string "auto"))
+
+; Infer the (worst-case) length from the instruction type by default.  Many
+; types can have an optional immediate word following, which we include here.
+; "Multiple" types are counted as two 64-bit instructions.  This is just a
+; default fallback: it can be overridden per-alternative in insn patterns for
+; greater accuracy.
+
+(define_attr "length" ""
+  (cond [(eq_attr "type" "sop1") (const_int 8)
+	 (eq_attr "type" "sop2") (const_int 8)
+	 (eq_attr "type" "sopk") (const_int 8)
+	 (eq_attr "type" "sopc") (const_int 8)
+	 (eq_attr "type" "sopp") (const_int 4)
+	 (eq_attr "type" "smem") (const_int 8)
+	 (eq_attr "type" "ds")   (const_int 8)
+	 (eq_attr "type" "vop1") (const_int 8)
+	 (eq_attr "type" "vop2") (const_int 8)
+	 (eq_attr "type" "vopc") (const_int 8)
+	 (eq_attr "type" "vop3a") (const_int 8)
+	 (eq_attr "type" "vop3b") (const_int 8)
+	 (eq_attr "type" "vop_sdwa") (const_int 8)
+	 (eq_attr "type" "vop_dpp") (const_int 8)
+	 (eq_attr "type" "flat") (const_int 8)
+	 (eq_attr "type" "mult") (const_int 16)
+	 (eq_attr "type" "vmult") (const_int 16)]
+	(const_int 4)))
+
+; Disable alternatives that only apply to specific ISA variants.
+
+(define_attr "gcn_version" "gcn3,gcn5" (const_string "gcn3"))
+
+(define_attr "enabled" ""
+  (cond [(eq_attr "gcn_version" "gcn3") (const_int 1)
+	 (and (eq_attr "gcn_version" "gcn5")
+	      (ne (symbol_ref "TARGET_GCN5_PLUS") (const_int 0)))
+	   (const_int 1)]
+	(const_int 0)))
+
+; We need to be able to identify v_readlane and v_writelane with
+; SGPR lane selection in order to handle "Manually Inserted Wait States".
+
+(define_attr "laneselect" "yes,no" (const_string "no"))
+
+;; }}}
+;; {{{ Iterators useful across the wole machine description
+
+(define_mode_iterator SIDI [SI DI])
+(define_mode_iterator SFDF [SF DF])
+(define_mode_iterator SISF [SI SF])
+(define_mode_iterator QIHI [QI HI])
+(define_mode_iterator DIDF [DI DF])
+
+;; }}}
+;; {{{ Attributes.
+
+; Translate RTX code into GCN instruction mnemonics with and without
+; suffixes such as _b32, etc.
+
+(define_code_attr mnemonic
+  [(minus "sub%i")
+   (plus "add%i")
+   (ashift "lshl%b")
+   (lshiftrt "lshr%b")
+   (ashiftrt "ashr%i")
+   (and "and%B")
+   (ior "or%B")
+   (xor "xor%B")
+   (mult "mul%i")
+   (smin "min%i")
+   (smax "max%i")
+   (umin "min%u")
+   (umax "max%u")
+   (not "not%b")
+   (popcount "bcnt_u32%b")])
+
+(define_code_attr bare_mnemonic
+  [(plus "add")
+   (minus "sub")
+   (and "and")
+   (ior "or")
+   (xor "xor")])
+
+(define_code_attr s_mnemonic
+  [(not "not%b")
+   (popcount "bcnt1_i32%b")])
+
+(define_code_attr revmnemonic
+  [(minus "subrev%i")
+   (ashift "lshlrev%b")
+   (lshiftrt "lshrrev%b")
+   (ashiftrt "ashrrev%i")])
+
+; Translate RTX code into corresponding expander name.
+
+(define_code_attr expander
+  [(and "and")
+   (ior "ior")
+   (xor "xor")
+   (plus "add")
+   (minus "sub")
+   (ashift "ashl")
+   (lshiftrt "lshr")
+   (ashiftrt "ashr")
+   (mult "mul")
+   (smin "smin")
+   (smax "smax")
+   (umin "umin")
+   (umax "umax")
+   (not "one_cmpl")
+   (popcount "popcount")])
+
+;; }}}
+;; {{{ Miscellaneous instructions
+
+(define_insn "nop"
+  [(const_int 0)]
+  ""
+  "s_nop\t0x0"
+  [(set_attr "type" "sopp")])
+
+; FIXME: What should the value of the immediate be? Zero is disallowed, so
+; pick 1 for now.
+(define_insn "trap"
+  [(trap_if (const_int 1) (const_int 0))]
+  ""
+  "s_trap\t1"
+  [(set_attr "type" "sopp")])
+
+;; }}}
+;; {{{ Moves
+
+;; All scalar modes we support moves in.
+(define_mode_iterator MOV_MODE [BI QI HI SI DI TI SF DF])
+
+; This is the entry point for creating all kinds of scalar moves,
+; including reloads and symbols.
+
+(define_expand "mov<mode>"
+  [(set (match_operand:MOV_MODE 0 "nonimmediate_operand")
+	(match_operand:MOV_MODE 1 "general_operand"))]
+  ""
+  {
+    if (MEM_P (operands[0]))
+      operands[1] = force_reg (<MODE>mode, operands[1]);
+
+    if (!lra_in_progress && !reload_completed
+	&& !gcn_valid_move_p (<MODE>mode, operands[0], operands[1]))
+      {
+	/* Something is probably trying to generate a move
+	   which can only work indirectly.
+	   E.g. Move from LDS memory to SGPR hardreg
+	     or MEM:QI to SGPR.  */
+	rtx tmpreg = gen_reg_rtx (<MODE>mode);
+	emit_insn (gen_mov<mode> (tmpreg, operands[1]));
+	emit_insn (gen_mov<mode> (operands[0], tmpreg));
+	DONE;
+      }
+
+    if (<MODE>mode == DImode
+	&& (GET_CODE (operands[1]) == SYMBOL_REF
+	    || GET_CODE (operands[1]) == LABEL_REF))
+      {
+	emit_insn (gen_movdi_symbol (operands[0], operands[1]));
+	DONE;
+      }
+  })
+
+; Split invalid moves into two valid moves
+
+(define_split
+  [(set (match_operand:MOV_MODE 0 "nonimmediate_operand")
+	(match_operand:MOV_MODE 1 "general_operand"))]
+  "!reload_completed && !lra_in_progress
+   && !gcn_valid_move_p (<MODE>mode, operands[0], operands[1])"
+  [(set (match_dup 2) (match_dup 1))
+   (set (match_dup 0) (match_dup 2))]
+  {
+    operands[2] = gen_reg_rtx(<MODE>mode);
+  })
+
+; We need BImode move so we can reload flags registers.
+
+(define_insn "*movbi"
+  [(set (match_operand:BI 0 "nonimmediate_operand"
+				    "=SD,   v,Sg,cs,cV,cV,Sm,RS, v,RF, v,RM")
+	(match_operand:BI 1 "gcn_load_operand"
+				    "SSA,vSSA, v,SS, v,SS,RS,Sm,RF, v,RM, v"))]
+  ""
+  {
+    /* SCC as an operand is currently not accepted by the LLVM assembler, so
+       we emit bytes directly as a workaround.  */
+    switch (which_alternative) {
+    case 0:
+      if (REG_P (operands[1]) && REGNO (operands[1]) == SCC_REG)
+	return "; s_mov_b32\t%0,%1 is not supported by the assembler.\;"
+	       ".byte\t0xfd\;"
+	       ".byte\t0x0\;"
+	       ".byte\t0x80|%R0\;"
+	       ".byte\t0xbe";
+      else
+	return "s_mov_b32\t%0, %1";
+    case 1:
+      if (REG_P (operands[1]) && REGNO (operands[1]) == SCC_REG)
+	return "; v_mov_b32\t%0, %1\;"
+	       ".byte\t0xfd\;"
+	       ".byte\t0x2\;"
+	       ".byte\t((%V0<<1)&0xff)\;"
+	       ".byte\t0x7e|(%V0>>7)";
+      else
+	return "v_mov_b32\t%0, %1";
+    case 2:
+      return "v_readlane_b32\t%0, %1, 0";
+    case 3:
+      return "s_cmpk_lg_u32\t%1, 0";
+    case 4:
+      return "v_cmp_ne_u32\tvcc, 0, %1";
+    case 5:
+      if (REGNO (operands[1]) == SCC_REG)
+	return "; s_mov_b32\t%0, %1 is not supported by the assembler.\;"
+	       ".byte\t0xfd\;"
+	       ".byte\t0x0\;"
+	       ".byte\t0xea\;"
+	       ".byte\t0xbe\;"
+	       "s_mov_b32\tvcc_hi, 0";
+      else
+	return "s_mov_b32\tvcc_lo, %1\;"
+	       "s_mov_b32\tvcc_hi, 0";
+    case 6:
+      return "s_load_dword\t%0, %A1\;s_waitcnt\tlgkmcnt(0)";
+    case 7:
+      return "s_store_dword\t%1, %A0\;s_waitcnt\tlgkmcnt(0)";
+    case 8:
+      return "flat_load_dword\t%0, %A1%O1%g1\;s_waitcnt\t0";
+    case 9:
+      return "flat_store_dword\t%A0, %1%O0%g0\;s_waitcnt\t0";
+    case 10:
+      return "global_load_dword\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)";
+    case 11:
+      return "global_store_dword\t%A0, %1%O0%g0\;s_waitcnt\tvmcnt(0)";
+    default:
+      gcc_unreachable ();
+    }
+  }
+  [(set_attr "type" "sop1,vop1,vop3a,sopk,vopc,mult,smem,smem,flat,flat,
+		     flat,flat")
+   (set_attr "exec" "*,single,*,*,single,*,*,*,single,single,single,single")
+   (set_attr "length" "4,4,4,4,4,8,12,12,12,12,12,12")])
+
+; 32bit move pattern
+
+(define_insn "*mov<mode>_insn"
+  [(set (match_operand:SISF 0 "nonimmediate_operand"
+		  "=SD,SD,SD,SD,RB,Sm,RS,v,Sg, v, v,RF,v,RLRG,   v,SD, v,RM")
+	(match_operand:SISF 1 "gcn_load_operand"
+		  "SSA, J, B,RB,Sm,RS,Sm,v, v,SS,RF, v,B,   v,RLRG, Y,RM, v"))]
+  ""
+  "@
+  s_mov_b32\t%0, %1
+  s_movk_i32\t%0, %1
+  s_mov_b32\t%0, %1
+  s_buffer_load%s0\t%0, s[0:3], %1\;s_waitcnt\tlgkmcnt(0)
+  s_buffer_store%s1\t%1, s[0:3], %0\;s_waitcnt\tlgkmcnt(0)
+  s_load_dword\t%0, %A1\;s_waitcnt\tlgkmcnt(0)
+  s_store_dword\t%1, %A0\;s_waitcnt\tlgkmcnt(0)
+  v_mov_b32\t%0, %1
+  v_readlane_b32\t%0, %1, 0
+  v_writelane_b32\t%0, %1, 0
+  flat_load_dword\t%0, %A1%O1%g1\;s_waitcnt\t0
+  flat_store_dword\t%A0, %1%O0%g0\;s_waitcnt\t0
+  v_mov_b32\t%0, %1
+  ds_write_b32\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
+  ds_read_b32\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
+  s_mov_b32\t%0, %1
+  global_load_dword\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
+  global_store_dword\t%A0, %1%O0%g0\;s_waitcnt\tvmcnt(0)"
+  [(set_attr "type" "sop1,sopk,sop1,smem,smem,smem,smem,vop1,vop3a,vop3a,flat,
+		     flat,vop1,ds,ds,sop1,flat,flat")
+   (set_attr "exec" "*,*,*,*,*,*,*,single,*,*,single,single,single,
+	             single,single,*,single,single")
+   (set_attr "length" "4,4,8,12,12,12,12,4,8,8,12,12,8,12,12,8,12,12")])
+
+; 8/16bit move pattern
+
+(define_insn "*mov<mode>_insn"
+  [(set (match_operand:QIHI 0 "nonimmediate_operand"
+				 "=SD,SD,SD,v,Sg, v, v,RF,v,RLRG,   v, v,RM")
+	(match_operand:QIHI 1 "gcn_load_operand"
+				 "SSA, J, B,v, v,SS,RF, v,B,   v,RLRG,RM, v"))]
+  "gcn_valid_move_p (<MODE>mode, operands[0], operands[1])"
+  "@
+  s_mov_b32\t%0, %1
+  s_movk_i32\t%0, %1
+  s_mov_b32\t%0, %1
+  v_mov_b32\t%0, %1
+  v_readlane_b32\t%0, %1, 0
+  v_writelane_b32\t%0, %1, 0
+  flat_load%o1\t%0, %A1%O1%g1\;s_waitcnt\t0
+  flat_store%s0\t%A0, %1%O0%g0\;s_waitcnt\t0
+  v_mov_b32\t%0, %1
+  ds_write%b0\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
+  ds_read%u1\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
+  global_load%o1\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
+  global_store%s0\t%A0, %1%O0%g0\;s_waitcnt\tvmcnt(0)"
+  [(set_attr "type"
+	     "sop1,sopk,sop1,vop1,vop3a,vop3a,flat,flat,vop1,ds,ds,flat,flat")
+   (set_attr "exec" "*,*,*,single,*,*,single,single,single,single,
+                     single,single,single")
+   (set_attr "length" "4,4,8,4,4,4,12,12,8,12,12,12,12")])
+
+; 64bit move pattern
+
+(define_insn_and_split "*mov<mode>_insn"
+  [(set (match_operand:DIDF 0 "nonimmediate_operand"
+			  "=SD,SD,SD,RS,Sm,v, v,Sg, v, v,RF,RLRG,   v, v,RM")
+	(match_operand:DIDF 1 "general_operand"
+			  "SSA, C,DB,Sm,RS,v,DB, v,SS,RF, v,   v,RLRG,RM, v"))]
+  "GET_CODE(operands[1]) != SYMBOL_REF"
+  "@
+  s_mov_b64\t%0, %1
+  s_mov_b64\t%0, %1
+  #
+  s_store_dwordx2\t%1, %A0\;s_waitcnt\tlgkmcnt(0)
+  s_load_dwordx2\t%0, %A1\;s_waitcnt\tlgkmcnt(0)
+  #
+  #
+  #
+  #
+  flat_load_dwordx2\t%0, %A1%O1%g1\;s_waitcnt\t0
+  flat_store_dwordx2\t%A0, %1%O0%g0\;s_waitcnt\t0
+  ds_write_b64\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
+  ds_read_b64\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
+  global_load_dwordx2\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
+  global_store_dwordx2\t%A0, %1%O0%g0\;s_waitcnt\tvmcnt(0)"
+  "(reload_completed && !MEM_P (operands[0]) && !MEM_P (operands[1])
+    && !gcn_sgpr_move_p (operands[0], operands[1]))
+   || (GET_CODE (operands[1]) == CONST_INT && !gcn_constant64_p (operands[1]))"
+  [(set (match_dup 0) (match_dup 1))
+   (set (match_dup 2) (match_dup 3))]
+  {
+    rtx inlo = gen_lowpart (SImode, operands[1]);
+    rtx inhi = gen_highpart_mode (SImode, <MODE>mode, operands[1]);
+    rtx outlo = gen_lowpart (SImode, operands[0]);
+    rtx outhi = gen_highpart_mode (SImode, <MODE>mode, operands[0]);
+
+    /* Ensure that overlapping registers aren't corrupted.  */
+    if (REGNO (outlo) == REGNO (inhi))
+      {
+	operands[0] = outhi;
+	operands[1] = inhi;
+	operands[2] = outlo;
+	operands[3] = inlo;
+      }
+    else
+      {
+	operands[0] = outlo;
+	operands[1] = inlo;
+	operands[2] = outhi;
+	operands[3] = inhi;
+      }
+  }
+  [(set_attr "type" "sop1,sop1,mult,smem,smem,vmult,vmult,vmult,vmult,flat,
+		     flat,ds,ds,flat,flat")
+   (set_attr "exec" "*,*,*,*,*,*,*,*,*,single,single,single,single,single,
+		     single")
+   (set_attr "length" "4,8,*,12,12,*,*,*,*,12,12,12,12,12,12")])
+
+; 128-bit move.
+
+(define_insn_and_split "*movti_insn"
+  [(set (match_operand:TI 0 "nonimmediate_operand"
+				      "=SD,RS,Sm,RF, v,v, v,SD,RM, v,RL, v")
+	(match_operand:TI 1 "general_operand"  
+				      "SSB,Sm,RS, v,RF,v,SS, v, v,RM, v,RL"))]
+  ""
+  "@
+  #
+  s_store_dwordx4\t%1, %A0\;s_waitcnt\tlgkmcnt(0)
+  s_load_dwordx4\t%0, %A1\;s_waitcnt\tlgkmcnt(0)
+  flat_store_dwordx4\t%A0, %1%O0%g0\;s_waitcnt\t0
+  flat_load_dwordx4\t%0, %A1%O1%g1\;s_waitcnt\t0
+  #
+  #
+  #
+  global_store_dwordx4\t%A0, %1%O0%g0\;s_waitcnt\tvmcnt(0)
+  global_load_dwordx4\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0)
+  ds_write_b128\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
+  ds_read_b128\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)"
+  "reload_completed
+   && REG_P (operands[0])
+   && (REG_P (operands[1]) || GET_CODE (operands[1]) == CONST_INT)"
+  [(set (match_dup 0) (match_dup 1))
+   (set (match_dup 2) (match_dup 3))
+   (set (match_dup 4) (match_dup 5))
+   (set (match_dup 6) (match_dup 7))]
+  {
+    operands[6] = gcn_operand_part (TImode, operands[0], 3);
+    operands[7] = gcn_operand_part (TImode, operands[1], 3);
+    operands[4] = gcn_operand_part (TImode, operands[0], 2);
+    operands[5] = gcn_operand_part (TImode, operands[1], 2);
+    operands[2] = gcn_operand_part (TImode, operands[0], 1);
+    operands[3] = gcn_operand_part (TImode, operands[1], 1);
+    operands[0] = gcn_operand_part (TImode, operands[0], 0);
+    operands[1] = gcn_operand_part (TImode, operands[1], 0);
+  }
+  [(set_attr "type" "mult,smem,smem,flat,flat,vmult,vmult,vmult,flat,flat,\
+		     ds,ds")
+   (set_attr "exec" "*,*,*,single,single,*,*,*,single,single,single,single")
+   (set_attr "length" "*,12,12,12,12,*,*,*,12,12,12,12")])
+
+;; }}}
+;; {{{ Prologue/Epilogue
+
+(define_insn "prologue_use"
+  [(unspec_volatile [(match_operand 0)] UNSPECV_PROLOGUE_USE)]
+  ""
+  ""
+  [(set_attr "length" "0")])
+
+(define_expand "prologue"
+  [(const_int 0)]
+  ""
+  {
+    gcn_expand_prologue ();
+    DONE;
+  })
+
+(define_expand "epilogue"
+  [(const_int 0)]
+  ""
+  {
+    gcn_expand_epilogue ();
+    DONE;
+  })
+
+;; }}}
+;; {{{ Control flow
+
+; This pattern must satisfy simplejump_p, which means it cannot be a parallel
+; that clobbers SCC.  Thus, we must preserve SCC if we're generating a long
+; branch sequence.
+
+(define_insn "jump"
+  [(set (pc)
+	(label_ref (match_operand 0)))]
+  ""
+  {
+    if (get_attr_length (insn) == 4)
+      return "s_branch\t%0";
+    else
+      /* !!! This sequence clobbers EXEC_SAVE_REG and CC_SAVE_REG.  */
+      return "; s_mov_b32\ts22, scc is not supported by the assembler.\;"
+	     ".long\t0xbe9600fd\;"
+	     "s_getpc_b64\ts[20:21]\;"
+	     "s_add_u32\ts20, s20, %0@rel32@lo+4\;"
+	     "s_addc_u32\ts21, s21, %0@rel32@hi+4\;"
+	     "s_cmpk_lg_u32\ts22, 0\;"
+	     "s_setpc_b64\ts[20:21]";
+  }
+  [(set_attr "type" "sopp")
+   (set (attr "length")
+	(if_then_else (and (ge (minus (match_dup 0) (pc))
+			       (const_int -131072))
+			   (lt (minus (match_dup 0) (pc))
+			       (const_int 131072)))
+		      (const_int 4)
+		      (const_int 32)))])
+
+(define_insn "indirect_jump"
+  [(set (pc)
+	(match_operand:DI 0 "register_operand" "Sg"))]
+  ""
+  "s_setpc_b64\t%0"
+  [(set_attr "type" "sop1")
+   (set_attr "length" "4")])
+
+(define_insn "cjump"
+  [(set (pc)
+	(if_then_else
+	  (match_operator:BI 1 "gcn_conditional_operator"
+	    [(match_operand:BI 2 "gcn_conditional_register_operand" " ca")
+	     (const_int 0)])
+	  (label_ref (match_operand 0))
+	  (pc)))
+   (clobber (match_scratch:BI 3					    "=cs"))]
+  ""
+  {
+    if (get_attr_length (insn) == 4)
+      return "s_cbranch%C1\t%0";
+    else
+      {
+	operands[1] = gen_rtx_fmt_ee (reverse_condition
+				       (GET_CODE (operands[1])),
+				      BImode, operands[2], const0_rtx);
+	/* !!! This sequence clobbers EXEC_SAVE_REG and SCC.  */
+	return "s_cbranch%C1\t.skip%=\;"
+	       "s_getpc_b64\ts[20:21]\;"
+	       "s_add_u32\ts20, s20, %0@rel32@lo+4\;"
+	       "s_addc_u32\ts21, s21, %0@rel32@hi+4\;"
+	       "s_setpc_b64\ts[20:21]\n"
+	       ".skip%=:";
+      }
+  }
+  [(set_attr "type" "sopp")
+   (set (attr "length")
+	(if_then_else (and (ge (minus (match_dup 0) (pc))
+			       (const_int -131072))
+			   (lt (minus (match_dup 0) (pc))
+			       (const_int 131072)))
+		      (const_int 4)
+		      (const_int 28)))])
+
+; Returning from a normal function is different to returning from a
+; kernel function.
+
+(define_insn "gcn_return"
+  [(return)]
+  ""
+  {
+    if (cfun && cfun->machine && cfun->machine->normal_function)
+      return "s_setpc_b64\ts[18:19]";
+    else
+      return "s_dcache_wb\;s_endpgm";
+  }
+  [(set_attr "type" "sop1")
+   (set_attr "length" "8")])
+
+(define_expand "call"
+  [(parallel [(call (match_operand 0 "")
+		    (match_operand 1 ""))
+	      (clobber (reg:DI LR_REGNUM))
+	      (clobber (match_scratch:DI 2))])]
+  ""
+  {})
+
+(define_insn "gcn_simple_call"
+  [(call (mem (match_operand 0 "immediate_operand" "Y,B"))
+	 (match_operand 1 "const_int_operand"))
+   (clobber (reg:DI LR_REGNUM))
+   (clobber (match_scratch:DI 2 "=&Sg,X"))]
+  ""
+  "@
+  s_getpc_b64\t%2\;s_add_u32\t%L2, %L2, %0@rel32@lo+4\;s_addc_u32\t%H2, %H2, %0@rel32@hi+4\;s_swappc_b64\ts[18:19], %2
+  s_swappc_b64\ts[18:19], %0"
+  [(set_attr "type" "mult,sop1")
+   (set_attr "length" "24,4")])
+
+(define_insn "movdi_symbol"
+ [(set (match_operand:DI 0 "nonimmediate_operand" "=Sg")
+       (match_operand:DI 1 "general_operand" "Y"))
+  (clobber (reg:BI SCC_REG))]
+ "GET_CODE (operands[1]) == SYMBOL_REF || GET_CODE (operands[1]) == LABEL_REF"
+  {
+    if (SYMBOL_REF_P (operands[1])
+	&& SYMBOL_REF_WEAK (operands[1]))
+	return "s_getpc_b64\t%0\;"
+	       "s_add_u32\t%L0, %L0, %1@gotpcrel32@lo+4\;"
+	       "s_addc_u32\t%H0, %H0, %1@gotpcrel32@hi+4\;"
+	       "s_load_dwordx2\t%0, %0\;"
+	       "s_waitcnt\tlgkmcnt(0)";
+
+    return "s_getpc_b64\t%0\;"
+	   "s_add_u32\t%L0, %L0, %1@rel32@lo+4\;"
+	   "s_addc_u32\t%H0, %H0, %1@rel32@hi+4";
+  }
+ [(set_attr "type" "mult")
+  (set_attr "length" "32")])
+
+(define_insn "gcn_indirect_call"
+  [(call (mem (match_operand:DI 0 "register_operand" "Sg"))
+	 (match_operand 1 "" ""))
+   (clobber (reg:DI LR_REGNUM))
+   (clobber (match_scratch:DI 2 "=X"))]
+  ""
+  "s_swappc_b64\ts[18:19], %0"
+  [(set_attr "type" "sop1")
+   (set_attr "length" "4")])
+
+(define_expand "call_value"
+  [(parallel [(set (match_operand 0 "")
+		   (call (match_operand 1 "")
+			 (match_operand 2 "")))
+	      (clobber (reg:DI LR_REGNUM))
+	      (clobber (match_scratch:DI 3))])]
+  ""
+  {})
+
+(define_insn "gcn_call_value"
+  [(set (match_operand 0 "register_operand" "=Sg,Sg")
+	(call (mem (match_operand 1 "immediate_operand" "Y,B"))
+	      (match_operand 2 "const_int_operand")))
+   (clobber (reg:DI LR_REGNUM))
+   (clobber (match_scratch:DI 3 "=&Sg,X"))]
+  ""
+  "@
+  s_getpc_b64\t%3\;s_add_u32\t%L3, %L3, %1@rel32@lo+4\;s_addc_u32\t%H3, %H3, %1@rel32@hi+4\;s_swappc_b64\ts[18:19], %3
+  s_swappc_b64\ts[18:19], %1"
+  [(set_attr "type" "sop1")
+   (set_attr "length" "24")])
+
+(define_insn "gcn_call_value_indirect"
+  [(set (match_operand 0 "register_operand" "=Sg")
+	(call (mem (match_operand:DI 1 "register_operand" "Sg"))
+	      (match_operand 2 "" "")))
+   (clobber (reg:DI LR_REGNUM))
+   (clobber (match_scratch:DI 3 "=X"))]
+  ""
+  "s_swappc_b64\ts[18:19], %1"
+  [(set_attr "type" "sop1")
+   (set_attr "length" "4")])
+
+; GCN does not have an instruction to clear only part of the instruction
+; cache, so the operands are ignored.
+
+(define_insn "clear_icache"
+  [(unspec_volatile
+    [(match_operand 0 "") (match_operand 1 "")]
+    UNSPECV_ICACHE_INV)]
+  ""
+  "s_icache_inv"
+  [(set_attr "type" "sopp")
+   (set_attr "length" "4")])
+
+;; }}}
+;; {{{ Conditionals
+
+; 32-bit compare, scalar unit only
+
+(define_insn "cstoresi4"
+  [(set (match_operand:BI 0 "gcn_conditional_register_operand"
+							 "=cs, cs, cs, cs")
+	(match_operator:BI 1 "gcn_compare_operator"
+	  [(match_operand:SI 2 "gcn_alu_operand"	 "SSA,SSA,SSB, SS")
+	   (match_operand:SI 3 "gcn_alu_operand"	 "SSA,SSL, SS,SSB")]))]
+  ""
+  "@
+   s_cmp%D1\t%2, %3
+   s_cmpk%D1\t%2, %3
+   s_cmp%D1\t%2, %3
+   s_cmp%D1\t%2, %3"
+  [(set_attr "type" "sopc,sopk,sopk,sopk")
+   (set_attr "length" "4,4,8,8")])
+
+(define_expand "cbranchsi4"
+  [(match_operator 0 "gcn_compare_operator"
+     [(match_operand:SI 1 "gcn_alu_operand")
+      (match_operand:SI 2 "gcn_alu_operand")])
+   (match_operand 3)]
+  ""
+  {
+    rtx cc = gen_reg_rtx (BImode);
+    emit_insn (gen_cstoresi4 (cc, operands[0], operands[1], operands[2]));
+    emit_jump_insn (gen_cjump (operands[3],
+			       gen_rtx_NE (BImode, cc, const0_rtx), cc));
+    DONE;
+  })
+
+; 64-bit compare; either unit
+
+(define_expand "cstoredi4"
+  [(parallel [(set (match_operand:BI 0 "gcn_conditional_register_operand")
+		   (match_operator:BI 1 "gcn_compare_operator"
+		     [(match_operand:DI 2 "gcn_alu_operand")
+		      (match_operand:DI 3 "gcn_alu_operand")]))
+	      (use (match_dup 4))])]
+  ""
+  {
+    operands[4] = gcn_scalar_exec ();
+  })
+
+(define_insn "cstoredi4_vec_and_scalar"
+  [(set (match_operand:BI 0 "gcn_conditional_register_operand" "= cs,  cV")
+	(match_operator:BI 1 "gcn_compare_64bit_operator"
+	  [(match_operand:DI 2 "gcn_alu_operand"	       "%SSA,vSSC")
+	   (match_operand:DI 3 "gcn_alu_operand"	       " SSC,   v")]))
+   (use (match_operand:DI 4 "gcn_exec_operand"		       "   n,   e"))]
+  ""
+  "@
+   s_cmp%D1\t%2, %3
+   v_cmp%E1\tvcc, %2, %3"
+  [(set_attr "type" "sopc,vopc")
+   (set_attr "length" "8")])
+
+(define_insn "cstoredi4_vector"
+  [(set (match_operand:BI 0 "gcn_conditional_register_operand" "= cV")
+	(match_operator:BI 1 "gcn_compare_operator"
+          [(match_operand:DI 2 "gcn_alu_operand"	       "vSSB")
+	   (match_operand:DI 3 "gcn_alu_operand"	       "   v")]))
+   (use (match_operand:DI 4 "gcn_exec_operand"		       "   e"))]
+  ""
+  "v_cmp%E1\tvcc, %2, %3"
+  [(set_attr "type" "vopc")
+   (set_attr "length" "8")])
+
+(define_expand "cbranchdi4"
+  [(match_operator 0 "gcn_compare_operator"
+     [(match_operand:DI 1 "gcn_alu_operand")
+      (match_operand:DI 2 "gcn_alu_operand")])
+   (match_operand 3)]
+  ""
+  {
+    rtx cc = gen_reg_rtx (BImode);
+    emit_insn (gen_cstoredi4 (cc, operands[0], operands[1], operands[2]));
+    emit_jump_insn (gen_cjump (operands[3],
+			       gen_rtx_NE (BImode, cc, const0_rtx), cc));
+    DONE;
+  })
+
+; FP compare; vector unit only
+
+(define_expand "cstore<mode>4"
+  [(parallel [(set (match_operand:BI 0 "gcn_conditional_register_operand")
+		   (match_operator:BI 1 "gcn_fp_compare_operator"
+		     [(match_operand:SFDF 2 "gcn_alu_operand")
+		      (match_operand:SFDF 3 "gcn_alu_operand")]))
+	      (use (match_dup 4))])]
+  ""
+  {
+    operands[4] = gcn_scalar_exec ();
+  })
+
+(define_insn "cstore<mode>4_vec_and_scalar"
+  [(set (match_operand:BI 0 "gcn_conditional_register_operand" "=cV")
+	(match_operator:BI 1 "gcn_fp_compare_operator"
+	  [(match_operand:SFDF 2 "gcn_alu_operand"		"vB")
+	   (match_operand:SFDF 3 "gcn_alu_operand"		 "v")]))
+   (use (match_operand:DI 4 "gcn_exec_operand"			 "e"))]
+  ""
+  "v_cmp%E1\tvcc, %2, %3"
+  [(set_attr "type" "vopc")
+   (set_attr "length" "8")])
+
+(define_expand "cbranch<mode>4"
+  [(match_operator 0 "gcn_fp_compare_operator"
+     [(match_operand:SFDF 1 "gcn_alu_operand")
+      (match_operand:SFDF 2 "gcn_alu_operand")])
+   (match_operand 3)]
+  ""
+  {
+    rtx cc = gen_reg_rtx (BImode);
+    emit_insn (gen_cstore<mode>4 (cc, operands[0], operands[1], operands[2]));
+    emit_jump_insn (gen_cjump (operands[3],
+			       gen_rtx_NE (BImode, cc, const0_rtx), cc));
+    DONE;
+  })
+
+;; }}}
+;; {{{ ALU special cases: Plus
+
+(define_code_iterator plus_minus [plus minus])
+
+(define_predicate "plus_minus_operator"
+  (match_code "plus,minus"))
+
+(define_expand "<expander>si3"
+  [(parallel [(set (match_operand:SI 0 "register_operand")
+		   (plus_minus:SI (match_operand:SI 1 "gcn_alu_operand")
+				  (match_operand:SI 2 "gcn_alu_operand")))
+	      (use (match_dup 3))
+	      (clobber (reg:BI SCC_REG))
+	      (clobber (reg:DI VCC_REG))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec ();
+  })
+
+; 32-bit add; pre-reload undecided unit.
+
+(define_insn "*addsi3_vec_and_scalar"
+  [(set (match_operand:SI 0 "register_operand"         "= Sg, Sg, Sg,   v")
+        (plus:SI (match_operand:SI 1 "gcn_alu_operand" "%SgA,  0,SgA,   v")
+		 (match_operand:SI 2 "gcn_alu_operand" " SgA,SgJ,  B,vBSg")))
+   (use (match_operand:DI 3 "gcn_exec_operand"         "   n,  n,  n,   e"))
+   (clobber (reg:BI SCC_REG))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "@
+   s_add_i32\t%0, %1, %2
+   s_addk_i32\t%0, %2
+   s_add_i32\t%0, %1, %2
+   v_add_i32\t%0, %1, %2"
+  [(set_attr "type" "sop2,sopk,sop2,vop2")
+   (set_attr "length" "4,4,8,8")])
+
+; Discard VCC clobber, post reload.
+
+(define_split
+  [(set (match_operand:SIDI 0 "register_operand")
+        (match_operator:SIDI 3 "plus_minus_operator"
+	  [(match_operand:SIDI 1 "gcn_alu_operand")
+	   (match_operand:SIDI 2 "gcn_alu_operand")]))
+   (use (match_operand:DI 4 "" ""))
+   (clobber (reg:BI SCC_REG))
+   (clobber (reg:DI VCC_REG))]
+  "reload_completed && gcn_sdst_register_operand (operands[0], VOIDmode)"
+  [(parallel [(set (match_dup 0)
+		   (match_op_dup 3 [(match_dup 1) (match_dup 2)]))
+	      (clobber (reg:BI SCC_REG))])])
+
+; Discard SCC clobber, post reload.
+; FIXME: do we have an insn for this?
+
+(define_split
+  [(set (match_operand:SIDI 0 "register_operand")
+        (match_operator:SIDI 3 "plus_minus_operator"
+			 [(match_operand:SIDI 1 "gcn_alu_operand")
+		          (match_operand:SIDI 2 "gcn_alu_operand")]))
+   (use (match_operand:DI 4 ""))
+   (clobber (reg:BI SCC_REG))
+   (clobber (reg:DI VCC_REG))]
+  "reload_completed && gcn_vgpr_register_operand (operands[0], VOIDmode)"
+  [(parallel [(set (match_dup 0)
+		   (match_op_dup 3 [(match_dup 1) (match_dup 2)]))
+	      (use (match_dup 4))
+	      (clobber (reg:DI VCC_REG))])])
+
+; 32-bit add, scalar unit.
+
+(define_insn "*addsi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"	       "= Sg, Sg, Sg")
+	(plus:SI (match_operand:SI 1 "gcn_alu_operand" "%SgA,  0,SgA")
+		 (match_operand:SI 2 "gcn_alu_operand" " SgA,SgJ,  B")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "@
+   s_add_i32\t%0, %1, %2
+   s_addk_i32\t%0, %2
+   s_add_i32\t%0, %1, %2"
+  [(set_attr "type" "sop2,sopk,sop2")
+   (set_attr "length" "4,4,8")])
+
+; Having this as an insn_and_split allows us to keep together DImode adds
+; through some RTL optimisation passes, and means the CC reg we set isn't
+; dependent on the constraint alternative (which doesn't seem to work well).
+
+; There's an early clobber in the case where "v[0:1]=v[1:2]+?" but
+; "v[0:1]=v[0:1]+?" is fine (as is "v[1:2]=v[0:1]+?", but that's trickier).
+
+; If v_addc_u32 is used to add with carry, a 32-bit literal constant cannot be
+; used as an operand due to the read of VCC, so we restrict constants to the
+; inlinable range for that alternative.
+
+(define_insn_and_split "adddi3"
+  [(set (match_operand:DI 0 "register_operand"		
+					      "=&Sg,&Sg,&Sg,&Sg,&v,&v,&v,&v")
+	(plus:DI (match_operand:DI 1 "register_operand" 
+					      "  Sg,  0,  0, Sg, v, 0, 0, v")
+		 (match_operand:DI 2 "nonmemory_operand"
+					      "   0,SgB,  0,SgB, 0,vA, 0,vA")))
+   (clobber (reg:BI SCC_REG))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+  {
+    rtx cc = gen_rtx_REG (BImode, gcn_vgpr_register_operand (operands[1],
+							     DImode)
+			  ? VCC_REG : SCC_REG);
+
+    emit_insn (gen_addsi3_scalar_carry
+	       (gcn_operand_part (DImode, operands[0], 0),
+		gcn_operand_part (DImode, operands[1], 0),
+		gcn_operand_part (DImode, operands[2], 0),
+		cc));
+    rtx val = gcn_operand_part (DImode, operands[2], 1);
+    if (val != const0_rtx)
+      emit_insn (gen_addcsi3_scalar
+		 (gcn_operand_part (DImode, operands[0], 1),
+		  gcn_operand_part (DImode, operands[1], 1),
+		  gcn_operand_part (DImode, operands[2], 1),
+		  cc, cc));
+    else
+      emit_insn (gen_addcsi3_scalar_zero
+		 (gcn_operand_part (DImode, operands[0], 1),
+		  gcn_operand_part (DImode, operands[1], 1),
+		  cc));
+    DONE;
+  }
+  [(set_attr "type" "mult,mult,mult,mult,vmult,vmult,vmult,vmult")
+   (set_attr "length" "8")
+   ; FIXME: These patterns should have (use (exec)) but that messes up
+   ;        the generic splitters, so use single instead
+   (set_attr "exec" "*,*,*,*,single,single,single,single")])
+
+;; Add with carry.
+
+(define_insn "addsi3_scalar_carry"
+  [(set (match_operand:SI 0 "register_operand"	       "= Sg, v")
+	(plus:SI (match_operand:SI 1 "gcn_alu_operand" "%SgA, v")
+		 (match_operand:SI 2 "gcn_alu_operand" " SgB,vB")))
+   (set (match_operand:BI 3 "register_operand"	       "= cs,cV")
+	(ltu:BI (plus:SI (match_dup 1)
+			 (match_dup 2))
+		(match_dup 1)))]
+  ""
+  "@
+   s_add_u32\t%0, %1, %2
+   v_add%^_u32\t%0, vcc, %2, %1"
+  [(set_attr "type" "sop2,vop2")
+   (set_attr "length" "8,8")
+   (set_attr "exec" "*,single")])
+
+(define_insn "addsi3_scalar_carry_cst"
+  [(set (match_operand:SI 0 "register_operand"           "=Sg, v")
+        (plus:SI (match_operand:SI 1 "gcn_alu_operand"   "SgA, v")
+		 (match_operand:SI 2 "const_int_operand" "  n, n")))
+   (set (match_operand:BI 4 "register_operand"           "=cs,cV")
+	(geu:BI (plus:SI (match_dup 1)
+			 (match_dup 2))
+		(match_operand:SI 3 "const_int_operand"  "  n, n")))]
+  "INTVAL (operands[2]) == -INTVAL (operands[3])"
+  "@
+   s_add_u32\t%0, %1, %2
+   v_add%^_u32\t%0, vcc, %2, %1"
+  [(set_attr "type" "sop2,vop2")
+   (set_attr "length" "4")
+   (set_attr "exec" "*,single")])
+
+(define_insn "addcsi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"			   "= Sg, v")
+	(plus:SI (plus:SI (zero_extend:SI
+			    (match_operand:BI 3 "register_operand" "= cs,cV"))
+			  (match_operand:SI 1 "gcn_alu_operand"    "%SgA, v"))
+		 (match_operand:SI 2 "gcn_alu_operand"		   " SgB,vA")))
+   (set (match_operand:BI 4 "register_operand"			   "=  3, 3")
+	(ior:BI (ltu:BI (plus:SI
+			  (plus:SI
+			    (zero_extend:SI (match_dup 3))
+			    (match_dup 1))
+			  (match_dup 2))
+			(match_dup 2))
+	        (ltu:BI (plus:SI (zero_extend:SI (match_dup 3)) (match_dup 1))
+		        (match_dup 1))))]
+  ""
+  "@
+   s_addc_u32\t%0, %1, %2
+   v_addc%^_u32\t%0, vcc, %2, %1, vcc"
+  [(set_attr "type" "sop2,vop2")
+   (set_attr "length" "8,4")
+   (set_attr "exec" "*,single")])
+
+(define_insn "addcsi3_scalar_zero"
+  [(set (match_operand:SI 0 "register_operand"		  "=Sg, v")
+        (plus:SI (zero_extend:SI
+		   (match_operand:BI 2 "register_operand" "=cs,cV"))
+		 (match_operand:SI 1 "gcn_alu_operand"    "SgA, v")))
+   (set (match_dup 2)
+	(ltu:BI (plus:SI (zero_extend:SI (match_dup 2))
+			 (match_dup 1))
+		(match_dup 1)))]
+  ""
+  "@
+   s_addc_u32\t%0, %1, 0
+   v_addc%^_u32\t%0, vcc, 0, %1, vcc"
+  [(set_attr "type" "sop2,vop2")
+   (set_attr "length" "4")
+   (set_attr "exec" "*,single")])
+
+; "addptr" is the same as "add" except that it must not write to VCC or SCC
+; as a side-effect.  Unfortunately GCN3 does not have a suitable instruction
+; for this, so we use a split to save and restore the condition code.
+; This pattern must use "Sg" instead of "SD" to prevent the compiler
+; assigning VCC as the destination.
+; FIXME: Provide GCN5 implementation
+
+(define_insn_and_split "addptrdi3"
+  [(set (match_operand:DI 0 "register_operand"		 "= Sg,  &v")
+	(plus:DI (match_operand:DI 1 "register_operand"	 "  Sg,  v0")
+		 (match_operand:DI 2 "nonmemory_operand" "SgDB,vDB0")))]
+  ""
+  {
+    if (which_alternative == 0)
+      return "#";
+
+    gcc_assert (!CONST_INT_P (operands[2])
+		   || gcn_inline_constant64_p (operands[2]));
+
+    const char *add_insn = TARGET_GCN3 ? "v_add_u32" : "v_add_co_u32";
+    const char *addc_insn = TARGET_GCN3 ? "v_addc_u32" : "v_addc_co_u32";
+
+    rtx operand2_lo = gcn_operand_part (DImode, operands[2], 0);
+    rtx operand2_hi = gcn_operand_part (DImode, operands[2], 1);
+    rtx new_operands[4] = { operands[0], operands[1], operand2_lo,
+			    gen_rtx_REG (DImode, CC_SAVE_REG) };
+    char buf[100];
+
+    sprintf (buf, "%s %%L0, %%3, %%2, %%L1", add_insn);
+    output_asm_insn (buf, new_operands);
+
+    new_operands[2] = operand2_hi;
+    sprintf (buf, "%s %%H0, %%3, %%2, %%H1, %%3", addc_insn);
+    output_asm_insn (buf, new_operands);
+
+    return "";
+  }
+  "reload_completed
+   && (!gcn_vgpr_register_operand (operands[0], DImode)
+       || (CONST_INT_P (operands[2])
+	   && !gcn_inline_constant64_p (operands[2])))"
+  [(const_int 0)]
+  {
+    rtx cc_reg, cc_save_reg;
+
+    if (gcn_vgpr_register_operand (operands[1], DImode))
+	{
+	  cc_reg = gen_rtx_REG (DImode, VCC_REG);
+	  cc_save_reg = gen_rtx_REG (DImode, CC_SAVE_REG);
+	  emit_insn (gen_movdi (cc_save_reg, cc_reg));
+	}
+    else
+	{
+	  cc_reg = gen_rtx_REG (BImode, SCC_REG);
+	  cc_save_reg = gen_rtx_REG (BImode, CC_SAVE_REG);
+	  emit_insn (gen_movbi (cc_save_reg, cc_reg));
+	}
+
+    emit_insn (gen_adddi3 (operands[0], operands[1], operands[2]));
+
+    if (gcn_vgpr_register_operand (operands[1], DImode))
+	emit_insn (gen_movdi (cc_reg, cc_save_reg));
+    else
+	emit_insn (gen_movbi (cc_reg, cc_save_reg));
+
+    DONE;
+  }
+  [(set_attr "type" "mult,vmult")
+   (set_attr "length" "16")
+   (set_attr "exec" "*,single")])
+
+;; }}}
+;; {{{ ALU special cases: Minus
+
+;; Note that the expand and splitters are shared with add, above.
+;; See "plus_minus".
+
+(define_insn "*subsi3_vec_and_scalar"
+  [(set (match_operand:SI 0 "register_operand"          "=Sg, Sg,    v,   v")
+	(minus:SI (match_operand:SI 1 "gcn_alu_operand" "SgA,SgA,    v,vBSg")
+		  (match_operand:SI 2 "gcn_alu_operand" "SgA,  B, vBSg,   v")))
+   (use (match_operand:DI 3 "gcn_exec_operand"          "  n,  n,    e,   e"))
+   (clobber (reg:BI SCC_REG))
+   (clobber (reg:DI VCC_REG))]
+  ""
+  "@
+   s_sub_i32\t%0, %1, %2
+   s_sub_i32\t%0, %1, %2
+   v_sub_i32\t%0, %1, %2
+   v_sub_i32\t%0, %1, %2"
+  [(set_attr "type" "sop2,sop2,vop2,vop2")
+   (set_attr "length" "4,8,8,8")])
+
+(define_insn "*subsi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"          "=Sg, Sg")
+        (minus:SI (match_operand:SI 1 "gcn_alu_operand" "SgA,SgA")
+		  (match_operand:SI 2 "gcn_alu_operand" "SgA,  B")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "s_sub_i32\t%0, %1, %2"
+  [(set_attr "type" "sop2,sop2")
+   (set_attr "length" "4,8")])
+
+(define_insn_and_split "subdi3"
+  [(set (match_operand:DI 0 "register_operand"        "=Sg, Sg")
+	(minus:DI
+		(match_operand:DI 1 "gcn_alu_operand" "SgA,SgB")
+		(match_operand:DI 2 "gcn_alu_operand" "SgB,SgA")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "#"
+  "reload_completed"
+  [(const_int 0)]
+  {
+    emit_insn (gen_subsi3_scalar_carry
+	       (gcn_operand_part (DImode, operands[0], 0),
+		gcn_operand_part (DImode, operands[1], 0),
+		gcn_operand_part (DImode, operands[2], 0)));
+    rtx val = gcn_operand_part (DImode, operands[2], 1);
+    if (val != const0_rtx)
+      emit_insn (gen_subcsi3_scalar
+		 (gcn_operand_part (DImode, operands[0], 1),
+		  gcn_operand_part (DImode, operands[1], 1),
+		  gcn_operand_part (DImode, operands[2], 1)));
+    else
+      emit_insn (gen_subcsi3_scalar_zero
+		 (gcn_operand_part (DImode, operands[0], 1),
+		  gcn_operand_part (DImode, operands[1], 1)));
+    DONE;
+  }
+  [(set_attr "length" "8")])
+
+(define_insn "subsi3_scalar_carry"
+  [(set (match_operand:SI 0 "register_operand"          "=Sg, Sg")
+        (minus:SI (match_operand:SI 1 "gcn_alu_operand" "SgA,SgB")
+		  (match_operand:SI 2 "gcn_alu_operand" "SgB,SgA")))
+   (set (reg:BI SCC_REG)
+	(gtu:BI (minus:SI (match_dup 1)
+			  (match_dup 2))
+		(match_dup 1)))]
+  ""
+  "s_sub_u32\t%0, %1, %2"
+  [(set_attr "type" "sop2")
+   (set_attr "length" "8")])
+
+(define_insn "subsi3_scalar_carry_cst"
+  [(set (match_operand:SI 0 "register_operand"           "=Sg")
+        (minus:SI (match_operand:SI 1 "gcn_alu_operand"  "SgA")
+		 (match_operand:SI 2 "const_int_operand" "  n")))
+   (set (reg:BI SCC_REG)
+	(leu:BI (minus:SI (match_dup 1)
+			 (match_dup 2))
+		(match_operand:SI 3 "const_int_operand"  "  n")))]
+  "INTVAL (operands[2]) == -INTVAL (operands[3])"
+  "s_sub_u32\t%0, %1, %2"
+  [(set_attr "type" "sop2")
+   (set_attr "length" "4")])
+
+(define_insn "subcsi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"                    "=Sg, Sg")
+        (minus:SI (minus:SI (zero_extend:SI (reg:BI SCC_REG))
+			    (match_operand:SI 1 "gcn_alu_operand" "SgA,SgB"))
+		 (match_operand:SI 2 "gcn_alu_operand"            "SgB,SgA")))
+   (set (reg:BI SCC_REG)
+	(ior:BI (gtu:BI (minus:SI (minus:SI (zero_extend:SI (reg:BI SCC_REG))
+					    (match_dup 1))
+				 (match_dup 2))
+			(match_dup 1))
+	        (gtu:BI (minus:SI (zero_extend:SI (reg:BI SCC_REG))
+				  (match_dup 1))
+		        (match_dup 1))))]
+  ""
+  "s_subb_u32\t%0, %1, %2"
+  [(set_attr "type" "sop2")
+   (set_attr "length" "8")])
+
+(define_insn "subcsi3_scalar_zero"
+  [(set (match_operand:SI 0 "register_operand"		"=Sg")
+        (minus:SI (zero_extend:SI (reg:BI SCC_REG))
+		  (match_operand:SI 1 "gcn_alu_operand" "SgA")))
+   (set (reg:BI SCC_REG)
+	(gtu:BI (minus:SI (zero_extend:SI (reg:BI SCC_REG)) (match_dup 1))
+		(match_dup 1)))]
+  ""
+  "s_subb_u32\t%0, %1, 0"
+  [(set_attr "type" "sop2")
+   (set_attr "length" "4")])
+
+;; }}}
+;; {{{ ALU: mult
+
+(define_expand "mulsi3"
+  [(set (match_operand:SI 0 "register_operand")
+        (mult:SI (match_operand:SI 1 "gcn_alu_operand")
+		 (match_operand:SI 2 "gcn_alu_operand")))
+   (use (match_dup 3))]
+  ""
+  {
+    operands[3] = gcn_scalar_exec ();
+  })
+
+; Vector multiply has vop3a encoding, but no corresponding vop2a, so no long
+; immediate.
+(define_insn_and_split "*mulsi3_vec_and_scalar"
+  [(set (match_operand:SI 0 "register_operand"	       "= Sg,Sg, Sg,   v")
+        (mult:SI (match_operand:SI 1 "gcn_alu_operand" "%SgA, 0,SgA,   v")
+		 (match_operand:SI 2 "gcn_alu_operand" " SgA, J,  B,vASg")))
+   (use (match_operand:DI 3 "gcn_exec_operand"         "   n, n,  n,   e"))]
+  ""
+  "@
+   #
+   #
+   #
+   v_mul_lo_i32\t%0, %1, %2"
+  "reload_completed && gcn_sdst_register_operand (operands[0], VOIDmode)"
+   [(set (match_dup 0)
+	 (mult:SI (match_dup 1)
+		  (match_dup 2)))]
+  {}
+  [(set_attr "type" "sop2,sopk,sop2,vop3a")
+   (set_attr "length" "4,4,8,4")])
+
+(define_insn "*mulsi3_scalar"
+  [(set (match_operand:SI 0 "register_operand"	       "= Sg,Sg, Sg")
+	(mult:SI (match_operand:SI 1 "gcn_alu_operand" "%SgA, 0,SgA")
+		 (match_operand:SI 2 "gcn_alu_operand" " SgA, J,  B")))]
+  ""
+  "@
+   s_mul_i32\t%0, %1, %2
+   s_mulk_i32\t%0, %2
+   s_mul_i32\t%0, %1, %2"
+  [(set_attr "type" "sop2,sopk,sop2")
+   (set_attr "length" "4,4,8")])
+
+;; }}}
+;; {{{ ALU: generic 32-bit unop
+
+(define_code_iterator vec_and_scalar_unop [not popcount])
+
+; The const0_rtx serves as a device to differentiate patterns
+(define_expand "<expander>si2"
+  [(parallel [(set (match_operand:SI 0 "register_operand")
+	           (vec_and_scalar_unop:SI
+		     (match_operand:SI 1 "gcn_alu_operand")))
+	      (use (match_dup 2))
+	      (use (match_dup 3))
+	      (clobber (reg:BI SCC_REG))])]
+  ""
+  {
+    operands[2] = gcn_scalar_exec ();
+    operands[3] = const0_rtx;
+  })
+
+(define_insn "*<expander>si2"
+  [(set (match_operand:SI 0 "register_operand"  "=Sg,   v")
+        (vec_and_scalar_unop:SI
+	  (match_operand:SI 1 "gcn_alu_operand" "SgB,vSgB")))
+   (use (match_operand:DI 2 "gcn_exec_operand"  "  n,   e"))
+   (use (const_int 0))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "@
+   s_<s_mnemonic>0\t%0, %1
+   v_<s_mnemonic>0\t%0, %1"
+  [(set_attr "type" "sop1,vop1")
+   (set_attr "length" "8")])
+
+(define_insn "*<expander>si2_scalar"
+  [(set (match_operand:SI 0 "register_operand"			      "=Sg")
+        (vec_and_scalar_unop:SI (match_operand:SI 1 "gcn_alu_operand" "SgB")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "s_<s_mnemonic>0\t%0, %1"
+  [(set_attr "type" "sop1")
+   (set_attr "length" "8")])
+
+;; }}}
+;; {{{ ALU: generic 32-bit binop
+
+(define_code_iterator vec_and_scalar [and ior xor ashift lshiftrt
+				      ashiftrt smin smax umin umax])
+
+(define_expand "<expander>si3"
+  [(parallel [(set (match_operand:SI 0 "register_operand")
+	           (vec_and_scalar:SI
+		     (match_operand:SI 1 "gcn_alu_operand")
+		     (match_operand:SI 2 "gcn_alu_operand")))
+	      (use (match_dup 3))
+	      (clobber (reg:BI SCC_REG))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec ();
+  })
+
+; No plus and mult - they have variant with 16bit immediate
+; and thus are defined later.
+(define_code_iterator vec_and_scalar_com [and ior xor smin smax umin umax])
+(define_code_iterator vec_and_scalar_nocom [ashift lshiftrt ashiftrt])
+
+(define_insn "*<expander>si3"
+  [(set (match_operand:SI 0 "register_operand"  "= Sg,   v")
+        (vec_and_scalar_com:SI
+	  (match_operand:SI 1 "gcn_alu_operand" "%SgA,   v")
+	  (match_operand:SI 2 "gcn_alu_operand" " SgB,vSgB")))
+   (use (match_operand:DI 3 "gcn_exec_operand"  "   n,   e"))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "@
+   s_<mnemonic>0\t%0, %1, %2
+   v_<mnemonic>0\t%0, %1, %2"
+  [(set_attr "type" "sop2,vop2")
+   (set_attr "length" "8")])
+
+(define_insn "*<expander>si3_scalar"
+  [(set (match_operand:SI 0 "register_operand"   "= Sg")
+        (vec_and_scalar_com:SI
+	  (match_operand:SI 1 "register_operand" "%SgA")
+	  (match_operand:SI 2 "gcn_alu_operand"  " SgB")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "s_<mnemonic>0\t%0, %1, %2"
+  [(set_attr "type" "sop2")
+   (set_attr "length" "8")])
+
+; We expect this to be split, post-reload to remove the dependency on the
+; exec register in the scalar case.
+
+(define_insn "*<expander>si3_vec_and_scalar"
+  [(set (match_operand:SI 0 "register_operand"	 "=Sg, Sg,   v")
+        (vec_and_scalar_nocom:SI
+	  (match_operand:SI 1 "gcn_alu_operand"  "SgB,SgA,   v")
+	  (match_operand:SI 2 "gcn_alu_operand"  "SgA,SgB,vSgB")))
+     (use (match_operand:DI 3 "gcn_exec_operand" "  n,  n,   e"))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "@
+   s_<mnemonic>0\t%0, %1, %2
+   s_<mnemonic>0\t%0, %1, %2
+   v_<mnemonic>0\t%0, %1, %2"
+  [(set_attr "type" "sop2,sop2,vop2")
+   (set_attr "length" "8")])
+
+(define_insn "<expander>si3_scalar"
+  [(set (match_operand:SI 0 "register_operand"  "=Sg,Sg")
+        (vec_and_scalar_nocom:SI
+	  (match_operand:SI 1 "gcn_alu_operand" "SgB,SgA")
+	  (match_operand:SI 2 "gcn_alu_operand" "SgA,SgB")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "@
+   s_<mnemonic>0\t%0, %1, %2
+   s_<mnemonic>0\t%0, %1, %2"
+  [(set_attr "type" "sop2,sop2")
+   (set_attr "length" "8")])
+
+;; }}}
+;; {{{ ALU: generic 64-bit
+
+(define_code_iterator vec_and_scalar64_com [and ior xor])
+
+(define_expand "<expander>di3"
+  [(parallel [(set (match_operand:DI 0 "register_operand")
+		    (vec_and_scalar64_com:DI
+			(match_operand:DI 1 "gcn_alu_operand")
+			(match_operand:DI 2 "gcn_alu_operand")))
+	      (use (match_dup 3))
+	      (clobber (reg:BI SCC_REG))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec ();
+  })
+
+(define_insn_and_split "*<expander>di3_vec_and_scalar"
+   [(set (match_operand:DI 0 "register_operand"   "= Sg,  &v,  &v")
+	 (vec_and_scalar64_com:DI
+	  (match_operand:DI 1 "gcn_alu_operand"   "%SgA,   v,   0")
+	   (match_operand:DI 2 "gcn_alu_operand"  " SgC,vSgB,vSgB")))
+      (use (match_operand:DI 3 "gcn_exec_operand" "   n,   e,   e"))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "@
+   s_<mnemonic>0\t%0, %1, %2
+   #
+   #"
+  "reload_completed && gcn_vgpr_register_operand (operands[0], DImode)"
+  [(parallel [(set (match_dup 4)
+		   (vec_and_scalar64_com:SI (match_dup 5) (match_dup 6)))
+	      (use (match_dup 3))
+	      (clobber (reg:BI SCC_REG))])
+   (parallel [(set (match_dup 7)
+		   (vec_and_scalar64_com:SI (match_dup 8) (match_dup 9)))
+	      (use (match_dup 3))
+	      (clobber (reg:BI SCC_REG))])]
+  {
+    operands[4] = gcn_operand_part (DImode, operands[0], 0);
+    operands[5] = gcn_operand_part (DImode, operands[1], 0);
+    operands[6] = gcn_operand_part (DImode, operands[2], 0);
+    operands[7] = gcn_operand_part (DImode, operands[0], 1);
+    operands[8] = gcn_operand_part (DImode, operands[1], 1);
+    operands[9] = gcn_operand_part (DImode, operands[2], 1);
+  }
+  [(set_attr "type" "sop2,vop2,vop2")
+   (set_attr "length" "8")])
+
+(define_insn "*<expander>di3_scalar"
+  [(set (match_operand:DI 0 "register_operand"  "= Sg")
+        (vec_and_scalar64_com:DI
+	  (match_operand:DI 1 "gcn_alu_operand" "%SgA")
+	  (match_operand:DI 2 "gcn_alu_operand" " SgC")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "s_<mnemonic>0\t%0, %1, %2"
+  [(set_attr "type" "sop2")
+   (set_attr "length" "8")])
+
+(define_expand "<expander>di3"
+  [(parallel [(set (match_operand:DI 0 "register_operand")
+		   (vec_and_scalar_nocom:DI
+		     (match_operand:DI 1 "gcn_alu_operand")
+		     (match_operand:SI 2 "gcn_alu_operand")))
+	      (clobber (reg:BI SCC_REG))])]
+  ""
+  {
+    operands[3] = gcn_scalar_exec ();
+  })
+
+(define_insn "*<expander>di3_vec_and_scalar"
+  [(set (match_operand:DI 0 "register_operand"   "=Sg, Sg,   v")
+	(vec_and_scalar_nocom:DI
+	  (match_operand:DI 1 "gcn_alu_operand"  "SgC,SgA,   v")
+	  (match_operand:SI 2 "gcn_alu_operand"  "SgA,SgC,vSgC")))
+     (use (match_operand:DI 3 "gcn_exec_operand" "  n,  n,   e"))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "@
+   s_<mnemonic>0\t%0, %1, %2
+   s_<mnemonic>0\t%0, %1, %2
+   v_<mnemonic>0\t%0, %1, %2"
+  [(set_attr "type" "sop2,sop2,vop2")
+   (set_attr "length" "8")])
+
+(define_insn "*<expander>di3_scalar"
+  [(set (match_operand:DI 0 "register_operand"  "=Sg, Sg")
+        (vec_and_scalar_nocom:DI
+	  (match_operand:DI 1 "gcn_alu_operand" "SgC,SgA")
+	  (match_operand:SI 2 "gcn_alu_operand" "SgA,SgC")))
+   (clobber (reg:BI SCC_REG))]
+  ""
+  "s_<mnemonic>0\t%0, %1, %2"
+  [(set_attr "type" "sop2,sop2")
+   (set_attr "length" "8")])
+
+;; }}}
+;; {{{ Generic splitters
+
+;; These choose the proper insn variant once we've decided on using
+;; vector or scalar ALU.
+
+; Discard (use EXEC) from scalar unops.
+
+(define_split
+  [(set (match_operand 0 "gcn_sdst_register_operand")
+        (match_operator 3 "unary_operator"
+	  [(match_operand 1 "gcn_alu_operand")]))
+   (use (match_operand:DI 2 ""))
+   (use (const_int 0))]
+  "reload_completed"
+  [(set (match_dup 0) (match_op_dup 3 [(match_dup 1)]))])
+
+; Discard const0 from valu unops.
+
+(define_split
+  [(set (match_operand 0 "gcn_vgpr_register_operand")
+        (match_operator 3 "unary_operator"
+	  [(match_operand 1 "gcn_alu_operand")]))
+   (use (match_operand:DI 2 ""))
+   (use (const_int 0))]
+  "reload_completed"
+  [(parallel [(set (match_dup 0)
+		   (match_op_dup 3 [(match_dup 1)]))
+              (use (match_dup 2))])])
+
+; Discard (use EXEC) from scalar binops.
+
+(define_split
+  [(set (match_operand 0 "gcn_sdst_register_operand")
+        (match_operator 4 "binary_operator"
+	  [(match_operand 1 "gcn_alu_operand")
+	   (match_operand 2 "gcn_alu_operand")]))
+   (use (match_operand:DI 3 ""))
+   (clobber (reg:BI SCC_REG))]
+  "reload_completed"
+  [(parallel [(set (match_dup 0)
+		   (match_op_dup 4 [(match_dup 1) (match_dup 2)]))
+              (clobber (reg:BI SCC_REG))])])
+
+; Discard (clobber SCC) from valu binops.
+
+(define_split
+  [(set (match_operand 0 "gcn_vgpr_register_operand")
+        (match_operator 4 "binary_operator"
+	  [(match_operand 1 "gcn_alu_operand")
+	   (match_operand 2 "gcn_alu_operand")]))
+   (use (match_operand:DI 3 ""))
+   (clobber (reg:BI SCC_REG))]
+  "reload_completed"
+  [(parallel [(set (match_dup 0)
+		   (match_op_dup 4 [(match_dup 1) (match_dup 2)]))
+              (use (match_dup 3))])])
+
+;; }}}
+;; {{{ Atomics
+
+; Each compute unit has it's own L1 cache. The L2 cache is shared between
+; all the compute units.  Any load or store instruction can skip L1 and
+; access L2 directly using the "glc" flag.  Atomic instructions also skip
+; L1.  The L1 cache can be flushed and invalidated using instructions.
+;
+; Therefore, in order for "acquire" and "release" atomic modes to work
+; correctly across compute units we must flush before each "release"
+; and invalidate the cache after each "acquire".  It might seem like
+; invalidation could be safely done before an "acquire", but since each
+; compute unit can run up to 40 threads simultaneously, all reading values
+; into the L1 cache, this is not actually safe.
+;
+; Additionally, scalar flat instructions access L2 via a different cache
+; (the "constant cache"), so they have separate constrol instructions.  We
+; do not attempt to invalidate both caches at once; instead, atomics
+; operating on scalar flat pointers will flush the constant cache, and
+; atomics operating on flat or global pointers will flush L1.  It is up to
+; the programmer to get this right.
+
+(define_code_iterator atomicops [plus minus and ior xor])
+(define_mode_attr X [(SI "") (DI "_X2")])
+
+;; TODO compare_and_swap test_and_set inc dec
+;; Hardware also supports min and max, but GCC does not.
+
+(define_expand "memory_barrier"
+  [(set (match_dup 0)
+	(unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
+  ""
+  {
+    operands[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
+    MEM_VOLATILE_P (operands[0]) = 1;
+  })
+
+(define_insn "*memory_barrier"
+  [(set (match_operand:BLK 0)
+	(unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
+  ""
+  "buffer_wbinvl1_vol"
+  [(set_attr "type" "mubuf")
+   (set_attr "length" "4")])
+
+; FIXME: These patterns have been disabled as they do not seem to work
+; reliably - they can cause hangs or incorrect results.
+; TODO: flush caches according to memory model
+(define_expand "atomic_fetch_<bare_mnemonic><mode>"
+  [(parallel [(set (match_operand:SIDI 0 "register_operand")
+		   (match_operand:SIDI 1 "memory_operand"))
+	      (set (match_dup 1)
+		   (unspec_volatile:SIDI
+		     [(atomicops:SIDI
+		       (match_dup 1)
+		       (match_operand:SIDI 2 "register_operand"))]
+		     UNSPECV_ATOMIC))
+	      (use (match_operand 3 "const_int_operand"))
+	      (use (match_dup 4))])]
+  "0 /* Disabled.  */"
+  {
+    operands[4] = gcn_scalar_exec ();
+  })
+
+(define_insn "*atomic_fetch_<bare_mnemonic><mode>_insn"
+  [(set (match_operand:SIDI 0 "register_operand"     "=Sm, v, v")
+	(match_operand:SIDI 1 "memory_operand"	     "+RS,RF,RM"))
+   (set (match_dup 1)
+	(unspec_volatile:SIDI
+	  [(atomicops:SIDI
+	    (match_dup 1)
+	    (match_operand:SIDI 2 "register_operand" " Sm, v, v"))]
+	   UNSPECV_ATOMIC))
+   (use (match_operand 3 "const_int_operand"))
+   (use (match_operand:DI 4 "gcn_exec_operand"       "  n, e, e"))]
+  "0 /* Disabled.  */"
+  "@
+   s_atomic_<bare_mnemonic><X>\t%0, %1, %2 glc\;s_waitcnt\tlgkmcnt(0)
+   flat_atomic_<bare_mnemonic><X>\t%0, %1, %2 glc\;s_waitcnt\t0
+   global_atomic_<bare_mnemonic><X>\t%0, %A1, %2%O1 glc\;s_waitcnt\tvmcnt(0)"
+  [(set_attr "type" "smem,flat,flat")
+   (set_attr "length" "12")
+   (set_attr "gcn_version" "gcn5,*,gcn5")])
+
+; FIXME: These patterns are disabled because the instructions don't
+; seem to work as advertised.  Specifically, OMP "team distribute"
+; reductions apparently "lose" some of the writes, similar to what
+; you might expect from a concurrent non-atomic read-modify-write.
+; TODO: flush caches according to memory model
+
+(define_expand "atomic_<bare_mnemonic><mode>"
+  [(parallel [(set (match_operand:SIDI 0 "memory_operand")
+		   (unspec_volatile:SIDI
+		     [(atomicops:SIDI
+		       (match_dup 0)
+		       (match_operand:SIDI 1 "register_operand"))]
+		    UNSPECV_ATOMIC))
+	      (use (match_operand 2 "const_int_operand"))
+	      (use (match_dup 3))])]
+  "0 /* Disabled.  */"
+  {
+    operands[3] = gcn_scalar_exec ();
+  })
+
+(define_insn "*atomic_<bare_mnemonic><mode>_insn"
+  [(set (match_operand:SIDI 0 "memory_operand"       "+RS,RF,RM")
+	(unspec_volatile:SIDI
+	  [(atomicops:SIDI
+	    (match_dup 0)
+	    (match_operand:SIDI 1 "register_operand" " Sm, v, v"))]
+	  UNSPECV_ATOMIC))
+   (use (match_operand 2 "const_int_operand"))
+   (use (match_operand:DI 3 "gcn_exec_operand"       "  n, e, e"))]
+  "0 /* Disabled.  */"
+  "@
+   s_atomic_<bare_mnemonic><X>\t%0, %1\;s_waitcnt\tlgkmcnt(0)
+   flat_atomic_<bare_mnemonic><X>\t%0, %1\;s_waitcnt\t0
+   global_atomic_<bare_mnemonic><X>\t%A0, %1%O0\;s_waitcnt\tvmcnt(0)"
+  [(set_attr "type" "smem,flat,flat")
+   (set_attr "length" "12")
+   (set_attr "gcn_version" "gcn5,*,gcn5")])
+
+(define_mode_attr x2 [(SI "DI") (DI "TI")])
+(define_mode_attr size [(SI "4") (DI "8")])
+(define_mode_attr bitsize [(SI "32") (DI "64")])
+
+(define_expand "sync_compare_and_swap<mode>"
+  [(match_operand:SIDI 0 "register_operand")
+   (match_operand:SIDI 1 "memory_operand")
+   (match_operand:SIDI 2 "register_operand")
+   (match_operand:SIDI 3 "register_operand")]
+  ""
+  {
+    if (MEM_ADDR_SPACE (operands[1]) == ADDR_SPACE_LDS)
+      {
+	rtx exec = gcn_scalar_exec ();
+	emit_insn (gen_sync_compare_and_swap<mode>_lds_insn (operands[0],
+							     operands[1],
+							     operands[2],
+							     operands[3],
+							     exec));
+	DONE;
+      }
+
+    /* Operands 2 and 3 must be placed in consecutive registers, and passed
+       as a combined value.  */
+    rtx src_cmp = gen_reg_rtx (<x2>mode);
+    emit_move_insn (gen_rtx_SUBREG (<MODE>mode, src_cmp, 0), operands[3]);
+    emit_move_insn (gen_rtx_SUBREG (<MODE>mode, src_cmp, <size>), operands[2]);
+    emit_insn (gen_sync_compare_and_swap<mode>_insn (operands[0],
+						     operands[1],
+						     src_cmp,
+						     gcn_scalar_exec ()));
+    DONE;
+  })
+
+(define_insn "sync_compare_and_swap<mode>_insn"
+  [(set (match_operand:SIDI 0 "register_operand"    "=Sm, v, v")
+	(match_operand:SIDI 1 "memory_operand"      "+RS,RF,RM"))
+   (set (match_dup 1)
+	(unspec_volatile:SIDI
+	  [(match_operand:<x2> 2 "register_operand" " Sm, v, v")]
+	  UNSPECV_ATOMIC))
+   (use (match_operand:DI 3 "gcn_exec_operand"      "  n, e, e"))]
+  ""
+  "@
+   s_atomic_cmpswap<X>\t%0, %1, %2 glc\;s_waitcnt\tlgkmcnt(0)
+   flat_atomic_cmpswap<X>\t%0, %1, %2 glc\;s_waitcnt\t0
+   global_atomic_cmpswap<X>\t%0, %A1, %2%O1 glc\;s_waitcnt\tvmcnt(0)"
+  [(set_attr "type" "smem,flat,flat")
+   (set_attr "length" "12")
+   (set_attr "gcn_version" "gcn5,*,gcn5")])
+
+(define_insn "sync_compare_and_swap<mode>_lds_insn"
+  [(set (match_operand:SIDI 0 "register_operand"    "= v")
+	(unspec_volatile:SIDI
+	  [(match_operand:SIDI 1 "memory_operand"   "+RL")]
+	  UNSPECV_ATOMIC))
+   (set (match_dup 1)
+	(unspec_volatile:SIDI
+	  [(match_operand:SIDI 2 "register_operand" "  v")
+	   (match_operand:SIDI 3 "register_operand" "  v")]
+	  UNSPECV_ATOMIC))
+   (use (match_operand:DI 4 "gcn_exec_operand"      "  e"))]
+  ""
+  "ds_cmpst_rtn_b<bitsize> %0, %1, %2, %3\;s_waitcnt\tlgkmcnt(0)"
+  [(set_attr "type" "ds")
+   (set_attr "length" "12")])
+
+(define_expand "atomic_load<mode>"
+  [(match_operand:SIDI 0 "register_operand")
+   (match_operand:SIDI 1 "memory_operand")
+   (match_operand 2 "immediate_operand")]
+  ""
+  {
+    emit_insn (gen_atomic_load<mode>_insn (operands[0], operands[1],
+					   operands[2], gcn_scalar_exec ()));
+    DONE;
+  })
+
+(define_insn "atomic_load<mode>_insn"
+  [(set (match_operand:SIDI 0 "register_operand"  "=Sm, v, v")
+	(unspec_volatile:SIDI
+	  [(match_operand:SIDI 1 "memory_operand" " RS,RF,RM")]
+	  UNSPECV_ATOMIC))
+   (use (match_operand:SIDI 2 "immediate_operand" "  i, i, i"))
+   (use (match_operand:DI 3 "gcn_exec_operand"    "  n, e, e"))]
+  ""
+  {
+    switch (INTVAL (operands[2]))
+      {
+      case MEMMODEL_RELAXED:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_load%o0\t%0, %A1 glc\;s_waitcnt\tlgkmcnt(0)";
+	  case 1:
+	    return "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0";
+	  case 2:
+	    return "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)";
+	  }
+	break;
+      case MEMMODEL_CONSUME:
+      case MEMMODEL_ACQUIRE:
+      case MEMMODEL_SYNC_ACQUIRE:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_load%o0\t%0, %A1 glc\;s_waitcnt\tlgkmcnt(0)\;"
+	           "s_dcache_wb_vol";
+	  case 1:
+	    return "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0\;"
+	           "buffer_wbinvl1_vol";
+	  case 2:
+	    return "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)\;"
+	           "buffer_wbinvl1_vol";
+	  }
+	break;
+      case MEMMODEL_ACQ_REL:
+      case MEMMODEL_SEQ_CST:
+      case MEMMODEL_SYNC_SEQ_CST:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_dcache_wb_vol\;s_load%o0\t%0, %A1 glc\;"
+	           "s_waitcnt\tlgkmcnt(0)\;s_dcache_inv_vol";
+	  case 1:
+	    return "buffer_wbinvl1_vol\;flat_load%o0\t%0, %A1%O1 glc\;"
+	           "s_waitcnt\t0\;buffer_wbinvl1_vol";
+	  case 2:
+	    return "buffer_wbinvl1_vol\;global_load%o0\t%0, %A1%O1 glc\;"
+	           "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	  }
+	break;
+      }
+    gcc_unreachable ();
+  }
+  [(set_attr "type" "smem,flat,flat")
+   (set_attr "length" "20")
+   (set_attr "gcn_version" "gcn5,*,gcn5")])
+
+(define_expand "atomic_store<mode>"
+  [(match_operand:SIDI 0 "memory_operand")
+   (match_operand:SIDI 1 "register_operand")
+   (match_operand 2 "immediate_operand")]
+  ""
+  {
+    emit_insn (gen_atomic_store<mode>_insn (operands[0], operands[1],
+					    operands[2], gcn_scalar_exec ()));
+    DONE;
+  })
+
+(define_insn "atomic_store<mode>_insn"
+  [(set (match_operand:SIDI 0 "memory_operand"      "=RS,RF,RM")
+	(unspec_volatile:SIDI
+	  [(match_operand:SIDI 1 "register_operand" " Sm, v, v")]
+	  UNSPECV_ATOMIC))
+  (use (match_operand:SIDI 2 "immediate_operand"    "  i, i, i"))
+  (use (match_operand:DI 3 "gcn_exec_operand"       "  n, e, e"))]
+  ""
+  {
+    switch (INTVAL (operands[2]))
+      {
+      case MEMMODEL_RELAXED:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_store%o1\t%1, %A0 glc\;s_waitcnt\tlgkmcnt(0)";
+	  case 1:
+	    return "flat_store%o1\t%A0, %1%O0 glc\;s_waitcnt\t0";
+	  case 2:
+	    return "global_store%o1\t%A0, %1%O0 glc\;s_waitcnt\tvmcnt(0)";
+	  }
+	break;
+      case MEMMODEL_RELEASE:
+      case MEMMODEL_SYNC_RELEASE:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_dcache_wb_vol\;s_store%o1\t%1, %A0 glc\;"
+		   "s_waitcnt\tlgkmcnt(0)";
+	  case 1:
+	    return "buffer_wbinvl1_vol\;flat_store%o1\t%A0, %1%O0 glc\;"
+		   "s_waitcnt\t0";
+	  case 2:
+	    return "buffer_wbinvl1_vol\;global_store%o1\t%A0, %1%O0 glc\;"
+	           "s_waitcnt\tvmcnt(0)";
+	  }
+	break;
+      case MEMMODEL_ACQ_REL:
+      case MEMMODEL_SEQ_CST:
+      case MEMMODEL_SYNC_SEQ_CST:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_dcache_wb_vol\;s_store%o1\t%1, %A0 glc\;"
+		   "s_waitcnt\tlgkmcnt(0)\;s_dcache_inv_vol";
+	  case 1:
+	    return "buffer_wbinvl1_vol\;flat_store%o1\t%A0, %1%O0 glc\;"
+		   "s_waitcnt\t0\;buffer_wbinvl1_vol";
+	  case 2:
+	    return "buffer_wbinvl1_vol\;global_store%o1\t%A0, %1%O0 glc\;"
+		   "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	  }
+	break;
+      }
+    gcc_unreachable ();
+  }
+  [(set_attr "type" "smem,flat,flat")
+   (set_attr "length" "20")
+   (set_attr "gcn_version" "gcn5,*,gcn5")])
+
+(define_expand "atomic_exchange<mode>"
+  [(match_operand:SIDI 0 "register_operand")
+   (match_operand:SIDI 1 "memory_operand")
+   (match_operand:SIDI 2 "register_operand")
+   (match_operand 3 "immediate_operand")]
+  ""
+  {
+    emit_insn (gen_atomic_exchange<mode>_insn (operands[0], operands[1],
+					       operands[2], operands[3],
+					       gcn_scalar_exec ()));
+    DONE;
+  })
+
+(define_insn "atomic_exchange<mode>_insn"
+  [(set (match_operand:SIDI 0 "register_operand"    "=Sm, v, v")
+        (match_operand:SIDI 1 "memory_operand"	    "+RS,RF,RM"))
+   (set (match_dup 1)
+	(unspec_volatile:SIDI
+	  [(match_operand:SIDI 2 "register_operand" " Sm, v, v")]
+	  UNSPECV_ATOMIC))
+   (use (match_operand 3 "immediate_operand"))
+   (use (match_operand:DI 4 "gcn_exec_operand"	    "  n, e, e"))]
+  ""
+  {
+    switch (INTVAL (operands[3]))
+      {
+      case MEMMODEL_RELAXED:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\tlgkmcnt(0)";
+	  case 1:
+	    return "flat_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\t0";
+	  case 2:
+	    return "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		   "s_waitcnt\tvmcnt(0)";
+	  }
+	break;
+      case MEMMODEL_CONSUME:
+      case MEMMODEL_ACQUIRE:
+      case MEMMODEL_SYNC_ACQUIRE:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\tlgkmcnt(0)\;"
+		   "s_dcache_wb_vol\;s_dcache_inv_vol";
+	  case 1:
+	    return "flat_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\t0\;"
+		   "buffer_wbinvl1_vol";
+	  case 2:
+	    return "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		   "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	  }
+	break;
+      case MEMMODEL_RELEASE:
+      case MEMMODEL_SYNC_RELEASE:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_dcache_wb_vol\;s_atomic_swap<X>\t%0, %1, %2 glc\;"
+		   "s_waitcnt\tlgkmcnt(0)";
+	  case 1:
+	    return "buffer_wbinvl1_vol\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
+		   "s_waitcnt\t0";
+	  case 2:
+	    return "buffer_wbinvl1_vol\;"
+		   "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		   "s_waitcnt\tvmcnt(0)";
+	  }
+	break;
+      case MEMMODEL_ACQ_REL:
+      case MEMMODEL_SEQ_CST:
+      case MEMMODEL_SYNC_SEQ_CST:
+	switch (which_alternative)
+	  {
+	  case 0:
+	    return "s_dcache_wb_vol\;s_atomic_swap<X>\t%0, %1, %2 glc\;"
+		   "s_waitcnt\tlgkmcnt(0)\;s_dcache_inv_vol";
+	  case 1:
+	    return "buffer_wbinvl1_vol\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
+		   "s_waitcnt\t0\;buffer_wbinvl1_vol";
+	  case 2:
+	    return "buffer_wbinvl1_vol\;"
+		   "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		   "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	  }
+	break;
+      }
+    gcc_unreachable ();
+  }
+  [(set_attr "type" "smem,flat,flat")
+   (set_attr "length" "20")
+   (set_attr "gcn_version" "gcn5,*,gcn5")])
+
+;; }}}
+;; {{{ OpenACC / OpenMP
+
+(define_expand "oacc_dim_size"
+  [(match_operand:SI 0 "register_operand")
+   (match_operand:SI 1 "const_int_operand")]
+  ""
+  {
+    rtx tmp = gcn_oacc_dim_size (INTVAL (operands[1]));
+    emit_move_insn (operands[0], gen_lowpart (SImode, tmp));
+    DONE;
+  })
+
+(define_expand "oacc_dim_pos"
+  [(match_operand:SI 0 "register_operand")
+   (match_operand:SI 1 "const_int_operand")]
+  ""
+  {
+    emit_move_insn (operands[0], gcn_oacc_dim_pos (INTVAL (operands[1])));
+    DONE;
+  })
+
+(define_expand "gcn_wavefront_barrier"
+  [(set (match_dup 0)
+	(unspec_volatile:BLK [(match_dup 0)] UNSPECV_BARRIER))]
+  ""
+  {
+    operands[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
+    MEM_VOLATILE_P (operands[0]) = 1;
+  })
+
+(define_insn "*gcn_wavefront_barrier"
+  [(set (match_operand:BLK 0 "")
+	(unspec_volatile:BLK [(match_dup 0)] UNSPECV_BARRIER))]
+  ""
+  "s_barrier"
+  [(set_attr "type" "sopp")])
+
+(define_expand "oacc_fork"
+  [(set (match_operand:SI 0 "")
+	(match_operand:SI 1 ""))
+   (use (match_operand:SI 2 ""))]
+  ""
+  {
+    /* We need to have oacc_fork/oacc_join named patterns as a pair,
+       but the fork isn't actually used.  */
+    gcc_unreachable ();
+  })
+
+(define_expand "oacc_join"
+  [(set (match_operand:SI 0 "")
+	(match_operand:SI 1 ""))
+   (use (match_operand:SI 2 ""))]
+  ""
+  {
+    emit_insn (gen_gcn_wavefront_barrier ());
+    DONE;
+  })
+
+;; }}}
+
+(include "gcn-valu.md")
diff --git a/gcc/config/gcn/gcn.opt b/gcc/config/gcn/gcn.opt
new file mode 100644
index 0000000..023c940
--- /dev/null
+++ b/gcc/config/gcn/gcn.opt
@@ -0,0 +1,78 @@
+; Options for the GCN port of the compiler.
+
+; Copyright (C) 2016-2018 Free Software Foundation, Inc.
+;
+; This file is part of GCC.
+;
+; GCC is free software; you can redistribute it and/or modify it under
+; the terms of the GNU General Public License as published by the Free
+; Software Foundation; either version 3, or (at your option) any later
+; version.
+;
+; GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+; WARRANTY; without even the implied warranty of MERCHANTABILITY or
+; FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+; for more details.
+;
+; You should have received a copy of the GNU General Public License
+; along with GCC; see the file COPYING3.  If not see
+; <http://www.gnu.org/licenses/>.
+
+HeaderInclude
+config/gcn/gcn-opts.h
+
+Enum
+Name(gpu_type) Type(enum processor_type)
+GCN GPU type to use:
+
+EnumValue
+Enum(gpu_type) String(carrizo) Value(PROCESSOR_CARRIZO)
+
+EnumValue
+Enum(gpu_type) String(fiji) Value(PROCESSOR_FIJI)
+
+EnumValue
+Enum(gpu_type) String(gfx900) Value(PROCESSOR_VEGA)
+
+march=
+Target RejectNegative Joined ToLower Enum(gpu_type) Var(gcn_arch) Init(PROCESSOR_CARRIZO)
+Specify the name of the target GPU.
+
+mtune=
+Target RejectNegative Joined ToLower Enum(gpu_type) Var(gcn_tune) Init(PROCESSOR_CARRIZO)
+Specify the name of the target GPU.
+
+m32
+Target Report RejectNegative InverseMask(ABI64)
+Generate code for a 32-bit ABI.
+
+m64
+Target Report RejectNegative Mask(ABI64)
+Generate code for a 64-bit ABI.
+
+mgomp
+Target Report RejectNegative
+Enable OpenMP GPU offloading.
+
+bool flag_bypass_init_error = false
+
+mbypass-init-error
+Target Report RejectNegative Var(flag_bypass_init_error)
+
+bool flag_worker_partitioning = false
+
+macc-experimental-workers
+Target Report Var(flag_worker_partitioning) Init(1)
+
+int stack_size_opt = -1
+
+mstack-size=
+Target Report RejectNegative Joined UInteger Var(stack_size_opt) Init(-1)
+-mstack-size=<number>	Set the private segment size per wave-front, in bytes.
+
+mlocal-symbol-id=
+Target RejectNegative Report JoinedOrMissing Var(local_symbol_id) Init(0)
+
+Wopenacc-dims
+Target Var(warn_openacc_dims) Warning
+Warn about invalid OpenACC dimensions.
diff --git a/gcc/config/gcn/mkoffload.c b/gcc/config/gcn/mkoffload.c
new file mode 100644
index 0000000..57e0f25
--- /dev/null
+++ b/gcc/config/gcn/mkoffload.c
@@ -0,0 +1,697 @@
+/* Offload image generation tool for AMD GCN.
+
+   Copyright (C) 2014-2018 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Munges GCN assembly into a C source file defining the GCN code as a
+   string.
+
+   This is not a complete assembler.  We presume the source is well
+   formed from the compiler and can die horribly if it is not.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "obstack.h"
+#include "diagnostic.h"
+#include "intl.h"
+#include <libgen.h>
+#include "collect-utils.h"
+#include "gomp-constants.h"
+
+const char tool_name[] = "gcn mkoffload";
+
+#define COMMENT_PREFIX "#"
+
+struct id_map
+{
+  id_map *next;
+  char *gcn_name;
+};
+
+static id_map *func_ids, **funcs_tail = &func_ids;
+static id_map *var_ids, **vars_tail = &var_ids;
+
+/* Files to unlink.  */
+static const char *gcn_s1_name;
+static const char *gcn_s2_name;
+static const char *gcn_o_name;
+static const char *gcn_cfile_name;
+
+enum offload_abi offload_abi = OFFLOAD_ABI_UNSET;
+
+/* Delete tempfiles.  */
+
+void
+tool_cleanup (bool from_signal ATTRIBUTE_UNUSED)
+{
+  if (gcn_cfile_name)
+    maybe_unlink (gcn_cfile_name);
+  if (gcn_s1_name)
+    maybe_unlink (gcn_s1_name);
+  if (gcn_s2_name)
+    maybe_unlink (gcn_s2_name);
+  if (gcn_o_name)
+    maybe_unlink (gcn_o_name);
+}
+
+static void
+mkoffload_cleanup (void)
+{
+  tool_cleanup (false);
+}
+
+/* Unlink FILE unless requested otherwise.  */
+
+void
+maybe_unlink (const char *file)
+{
+  if (!save_temps)
+    {
+      if (unlink_if_ordinary (file) && errno != ENOENT)
+	fatal_error (input_location, "deleting file %s: %m", file);
+    }
+  else if (verbose)
+    fprintf (stderr, "[Leaving %s]\n", file);
+}
+
+/* Add or change the value of an environment variable, outputting the
+   change to standard error if in verbose mode.  */
+
+static void
+xputenv (const char *string)
+{
+  if (verbose)
+    fprintf (stderr, "%s\n", string);
+  putenv (CONST_CAST (char *, string));
+}
+
+/* Read the whole input file.  It will be NUL terminated (but
+   remember, there could be a NUL in the file itself.  */
+
+static const char *
+read_file (FILE *stream, size_t *plen)
+{
+  size_t alloc = 16384;
+  size_t base = 0;
+  char *buffer;
+
+  if (!fseek (stream, 0, SEEK_END))
+    {
+      /* Get the file size.  */
+      long s = ftell (stream);
+      if (s >= 0)
+	alloc = s + 100;
+      fseek (stream, 0, SEEK_SET);
+    }
+  buffer = XNEWVEC (char, alloc);
+
+  for (;;)
+    {
+      size_t n = fread (buffer + base, 1, alloc - base - 1, stream);
+
+      if (!n)
+	break;
+      base += n;
+      if (base + 1 == alloc)
+	{
+	  alloc *= 2;
+	  buffer = XRESIZEVEC (char, buffer, alloc);
+	}
+    }
+  buffer[base] = 0;
+  *plen = base;
+  return buffer;
+}
+
+/* Parse STR, saving found tokens into PVALUES and return their number.
+   Tokens are assumed to be delimited by ':'.  */
+
+static unsigned
+parse_env_var (const char *str, char ***pvalues)
+{
+  const char *curval, *nextval;
+  char **values;
+  unsigned num = 1, i;
+
+  curval = strchr (str, ':');
+  while (curval)
+    {
+      num++;
+      curval = strchr (curval + 1, ':');
+    }
+
+  values = (char **) xmalloc (num * sizeof (char *));
+  curval = str;
+  nextval = strchr (curval, ':');
+  if (nextval == NULL)
+    nextval = strchr (curval, '\0');
+
+  for (i = 0; i < num; i++)
+    {
+      int l = nextval - curval;
+      values[i] = (char *) xmalloc (l + 1);
+      memcpy (values[i], curval, l);
+      values[i][l] = 0;
+      curval = nextval + 1;
+      nextval = strchr (curval, ':');
+      if (nextval == NULL)
+	nextval = strchr (curval, '\0');
+    }
+  *pvalues = values;
+  return num;
+}
+
+/* Auxiliary function that frees elements of PTR and PTR itself.
+   N is number of elements to be freed.  If PTR is NULL, nothing is freed.
+   If an element is NULL, subsequent elements are not freed.  */
+
+static void
+free_array_of_ptrs (void **ptr, unsigned n)
+{
+  unsigned i;
+  if (!ptr)
+    return;
+  for (i = 0; i < n; i++)
+    {
+      if (!ptr[i])
+	break;
+      free (ptr[i]);
+    }
+  free (ptr);
+  return;
+}
+
+/* Check whether NAME can be accessed in MODE.  This is like access,
+   except that it never considers directories to be executable.  */
+
+static int
+access_check (const char *name, int mode)
+{
+  if (mode == X_OK)
+    {
+      struct stat st;
+
+      if (stat (name, &st) < 0 || S_ISDIR (st.st_mode))
+	return -1;
+    }
+
+  return access (name, mode);
+}
+
+/* Parse an input assembler file, extract the offload tables etc.,
+   and output (1) the assembler code, minus the tables (which can contain
+   problematic relocations), and (2) a C file with the offload tables
+   encoded as structured data.  */
+
+static void
+process_asm (FILE *in, FILE *out, FILE *cfile)
+{
+  int fn_count = 0, var_count = 0, dims_count = 0;
+  struct obstack fns_os, vars_os, varsizes_os, dims_os;
+  obstack_init (&fns_os);
+  obstack_init (&vars_os);
+  obstack_init (&varsizes_os);
+  obstack_init (&dims_os);
+
+  struct oaccdims
+  {
+    int d[3];
+    char *name;
+  } dim;
+
+  char buf[1000];
+  enum { IN_CODE, IN_VARS, IN_FUNCS } state = IN_CODE;
+  while (fgets (buf, sizeof (buf), in))
+    {
+      switch (state)
+	{
+	case IN_CODE:
+	  {
+	    if (sscanf (buf, " ;; OPENACC-DIMS: %d, %d, %d : %ms\n",
+			&dim.d[0], &dim.d[1], &dim.d[2], &dim.name) == 4)
+	      {
+		obstack_grow (&dims_os, &dim, sizeof (dim));
+		dims_count++;
+	      }
+	    break;
+	  }
+	case IN_VARS:
+	  {
+	    char *varname;
+	    unsigned varsize;
+	    if (sscanf (buf, " .8byte %ms\n", &varname))
+	      {
+		obstack_ptr_grow (&vars_os, varname);
+		fgets (buf, sizeof (buf), in);
+		if (!sscanf (buf, " .8byte %u\n", &varsize))
+		  abort ();
+		obstack_int_grow (&varsizes_os, varsize);
+		var_count++;
+	      }
+	    break;
+	  }
+	case IN_FUNCS:
+	  {
+	    char *funcname;
+	    if (sscanf (buf, "\t.8byte\t%ms\n", &funcname))
+	      {
+		obstack_ptr_grow (&fns_os, funcname);
+		fn_count++;
+		continue;
+	      }
+	    break;
+	  }
+	}
+
+      char dummy;
+      if (sscanf (buf, " .section .gnu.offload_vars%c", &dummy) > 0)
+	state = IN_VARS;
+      else if (sscanf (buf, " .section .gnu.offload_funcs%c", &dummy) > 0)
+	state = IN_FUNCS;
+      else if (sscanf (buf, " .section %c", &dummy) > 0
+	       || sscanf (buf, " .text%c", &dummy) > 0
+	       || sscanf (buf, " .bss%c", &dummy) > 0
+	       || sscanf (buf, " .data%c", &dummy) > 0
+	       || sscanf (buf, " .ident %c", &dummy) > 0)
+	state = IN_CODE;
+
+      if (state == IN_CODE)
+	fputs (buf, out);
+    }
+
+  char **fns = XOBFINISH (&fns_os, char **);
+  struct oaccdims *dims = XOBFINISH (&dims_os, struct oaccdims *);
+
+  fprintf (cfile, "#include <stdlib.h>\n");
+  fprintf (cfile, "#include <stdbool.h>\n\n");
+
+  char **vars = XOBFINISH (&vars_os, char **);
+  unsigned *varsizes = XOBFINISH (&varsizes_os, unsigned *);
+  fprintf (cfile,
+	   "static const struct global_var_info {\n"
+	   "  const char *name;\n"
+	   "  void *address;\n"
+	   "} vars[] = {\n");
+  int i;
+  for (i = 0; i < var_count; ++i)
+    {
+      const char *sep = i < var_count - 1 ? "," : " ";
+      fprintf (cfile, "  { \"%s\", NULL }%s /* size: %u */\n", vars[i], sep,
+	       varsizes[i]);
+    }
+  fprintf (cfile, "};\n\n");
+
+  obstack_free (&vars_os, NULL);
+  obstack_free (&varsizes_os, NULL);
+
+  /* Dump out function idents.  */
+  fprintf (cfile, "static const struct hsa_kernel_description {\n"
+	   "  const char *name;\n"
+	   "  unsigned omp_data_size;\n"
+	   "  bool gridified_kernel_p;\n"
+	   "  unsigned kernel_dependencies_count;\n"
+	   "  const char **kernel_dependencies;\n"
+	   "  int oacc_dims[3];\n"
+	   "} gcn_kernels[] = {\n  ");
+  dim.d[0] = dim.d[1] = dim.d[2] = 0;
+  const char *comma;
+  for (comma = "", i = 0; i < fn_count; comma = ",\n  ", i++)
+    {
+      /* Find if we recored dimensions for this function.  */
+      int *d = dim.d;		/* Previously zeroed.  */
+      for (int j = 0; j < dims_count; j++)
+	if (strcmp (fns[i], dims[j].name) == 0)
+	  {
+	    d = dims[j].d;
+	    break;
+	  }
+
+      fprintf (cfile, "%s{\"%s\", 0, 0, 0, NULL, {%d, %d, %d}}", comma,
+	       fns[i], d[0], d[1], d[2]);
+
+      free (fns[i]);
+    }
+  fprintf (cfile, "\n};\n\n");
+
+  obstack_free (&fns_os, NULL);
+  for (i = 0; i < dims_count; i++)
+    free (dims[i].name);
+  obstack_free (&dims_os, NULL);
+}
+
+/* Embed an object file into a C source file.  */
+
+static void
+process_obj (FILE *in, FILE *cfile)
+{
+  size_t len = 0;
+  const char *input = read_file (in, &len);
+  id_map const *id;
+  unsigned ix;
+
+  /* Dump out an array containing the binary.
+     FIXME: do this with objcopy.  */
+  fprintf (cfile, "static unsigned char gcn_code[] = {");
+  for (size_t i = 0; i < len; i += 17)
+    {
+      fprintf (cfile, "\n\t");
+      for (size_t j = i; j < i + 17 && j < len; j++)
+	fprintf (cfile, "%3u,", (unsigned char) input[j]);
+    }
+  fprintf (cfile, "\n};\n\n");
+
+  fprintf (cfile,
+	   "static const struct gcn_image {\n"
+	   "  char magic[4];\n"
+	   "  size_t size;\n"
+	   "  void *image;\n"
+	   "} gcn_image = {\n"
+	   "  \"GCN\",\n"
+	   "  %zu,\n"
+	   "  gcn_code\n"
+	   "};\n\n",
+	   len);
+
+  fprintf (cfile,
+	   "static const struct brig_image_desc {\n"
+	   "  const struct gcn_image *gcn_image;\n"
+	   "  unsigned kernel_count;\n"
+	   "  const struct hsa_kernel_description *kernel_infos;\n"
+	   "  unsigned global_variable_count;\n"
+	   "  const struct global_var_info *global_variables;\n"
+	   "} target_data = {\n"
+	   "  &gcn_image,\n"
+	   "  sizeof (gcn_kernels) / sizeof (gcn_kernels[0]),\n"
+	   "  gcn_kernels,\n"
+	   "  sizeof (vars) / sizeof (vars[0]),\n"
+	   "  vars\n"
+	   "};\n\n");
+
+  fprintf (cfile,
+	   "#ifdef __cplusplus\n"
+	   "extern \"C\" {\n"
+	   "#endif\n"
+	   "extern void GOMP_offload_register_ver"
+	   " (unsigned, const void *, int, const void *);\n"
+	   "extern void GOMP_offload_unregister_ver"
+	   " (unsigned, const void *, int, const void *);\n"
+	   "#ifdef __cplusplus\n"
+	   "}\n"
+	   "#endif\n\n");
+
+  fprintf (cfile, "extern const void *const __OFFLOAD_TABLE__[];\n\n");
+
+  fprintf (cfile, "static __attribute__((constructor)) void init (void)\n"
+	   "{\n"
+	   "  GOMP_offload_register_ver (%#x, __OFFLOAD_TABLE__,"
+	   " %d/*GCN*/, &target_data);\n"
+	   "};\n",
+	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_GCN),
+	   GOMP_DEVICE_GCN);
+
+  fprintf (cfile, "static __attribute__((destructor)) void fini (void)\n"
+	   "{\n"
+	   "  GOMP_offload_unregister_ver (%#x, __OFFLOAD_TABLE__,"
+	   " %d/*GCN*/, &target_data);\n"
+	   "};\n",
+	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_GCN),
+	   GOMP_DEVICE_GCN);
+}
+
+/* Compile a C file using the host compiler.  */
+
+static void
+compile_native (const char *infile, const char *outfile, const char *compiler)
+{
+  const char *collect_gcc_options = getenv ("COLLECT_GCC_OPTIONS");
+  if (!collect_gcc_options)
+    fatal_error (input_location,
+		 "environment variable COLLECT_GCC_OPTIONS must be set");
+
+  struct obstack argv_obstack;
+  obstack_init (&argv_obstack);
+  obstack_ptr_grow (&argv_obstack, compiler);
+  if (save_temps)
+    obstack_ptr_grow (&argv_obstack, "-save-temps");
+  if (verbose)
+    obstack_ptr_grow (&argv_obstack, "-v");
+  switch (offload_abi)
+    {
+    case OFFLOAD_ABI_LP64:
+      obstack_ptr_grow (&argv_obstack, "-m64");
+      break;
+    case OFFLOAD_ABI_ILP32:
+      obstack_ptr_grow (&argv_obstack, "-m32");
+      break;
+    default:
+      gcc_unreachable ();
+    }
+  obstack_ptr_grow (&argv_obstack, infile);
+  obstack_ptr_grow (&argv_obstack, "-c");
+  obstack_ptr_grow (&argv_obstack, "-o");
+  obstack_ptr_grow (&argv_obstack, outfile);
+  obstack_ptr_grow (&argv_obstack, NULL);
+
+  const char **new_argv = XOBFINISH (&argv_obstack, const char **);
+  fork_execute (new_argv[0], CONST_CAST (char **, new_argv), true);
+  obstack_free (&argv_obstack, NULL);
+}
+
+int
+main (int argc, char **argv)
+{
+  FILE *in = stdin;
+  FILE *out = stdout;
+  FILE *cfile = stdout;
+  const char *outname = 0, *offloadsrc = 0;
+
+  progname = "mkoffload";
+  diagnostic_initialize (global_dc, 0);
+
+  if (atexit (mkoffload_cleanup) != 0)
+    fatal_error (input_location, "atexit failed");
+
+  char *collect_gcc = getenv ("COLLECT_GCC");
+  if (collect_gcc == NULL)
+    fatal_error (input_location, "COLLECT_GCC must be set.");
+  const char *gcc_path = dirname (ASTRDUP (collect_gcc));
+  const char *gcc_exec = basename (ASTRDUP (collect_gcc));
+
+  size_t len = (strlen (gcc_path) + 1 + strlen (GCC_INSTALL_NAME) + 1);
+  char *driver = XALLOCAVEC (char, len);
+
+  if (strcmp (gcc_exec, collect_gcc) == 0)
+    /* collect_gcc has no path, so it was found in PATH.  Make sure we also
+       find accel-gcc in PATH.  */
+    gcc_path = NULL;
+
+  int driver_used = 0;
+  if (gcc_path != NULL)
+    driver_used = sprintf (driver, "%s/", gcc_path);
+  sprintf (driver + driver_used, "%s", GCC_INSTALL_NAME);
+
+  bool found = false;
+  if (gcc_path == NULL)
+    found = true;
+  else if (access_check (driver, X_OK) == 0)
+    found = true;
+  else
+    {
+      /* Don't use alloca pointer with XRESIZEVEC.  */
+      driver = NULL;
+      /* Look in all COMPILER_PATHs for GCC_INSTALL_NAME.  */
+      char **paths = NULL;
+      unsigned n_paths;
+      n_paths = parse_env_var (getenv ("COMPILER_PATH"), &paths);
+      for (unsigned i = 0; i < n_paths; i++)
+	{
+	  len = strlen (paths[i]) + 1 + strlen (GCC_INSTALL_NAME) + 1;
+	  driver = XRESIZEVEC (char, driver, len);
+	  sprintf (driver, "%s/%s", paths[i], GCC_INSTALL_NAME);
+	  if (access_check (driver, X_OK) == 0)
+	    {
+	      found = true;
+	      break;
+	    }
+	}
+      free_array_of_ptrs ((void **) paths, n_paths);
+    }
+
+  if (!found)
+    fatal_error (input_location,
+		 "offload compiler %s not found", GCC_INSTALL_NAME);
+
+  /* We may be called with all the arguments stored in some file and
+     passed with @file.  Expand them into argv before processing.  */
+  expandargv (&argc, &argv);
+
+  /* Scan the argument vector.  */
+  bool fopenmp = false;
+  bool fopenacc = false;
+  for (int i = 1; i < argc; i++)
+    {
+#define STR "-foffload-abi="
+      if (strncmp (argv[i], STR, strlen (STR)) == 0)
+	{
+	  if (strcmp (argv[i] + strlen (STR), "lp64") == 0)
+	    offload_abi = OFFLOAD_ABI_LP64;
+	  else if (strcmp (argv[i] + strlen (STR), "ilp32") == 0)
+	    offload_abi = OFFLOAD_ABI_ILP32;
+	  else
+	    fatal_error (input_location,
+			 "unrecognizable argument of option " STR);
+	}
+#undef STR
+      else if (strcmp (argv[i], "-fopenmp") == 0)
+	fopenmp = true;
+      else if (strcmp (argv[i], "-fopenacc") == 0)
+	fopenacc = true;
+      else if (strcmp (argv[i], "-save-temps") == 0)
+	save_temps = true;
+      else if (strcmp (argv[i], "-v") == 0)
+	verbose = true;
+    }
+  if (!(fopenacc ^ fopenmp))
+    fatal_error (input_location, "either -fopenacc or -fopenmp must be set");
+
+  const char *abi;
+  switch (offload_abi)
+    {
+    case OFFLOAD_ABI_LP64:
+      abi = "-m64";
+      break;
+    case OFFLOAD_ABI_ILP32:
+      abi = "-m32";
+      break;
+    default:
+      gcc_unreachable ();
+    }
+
+  gcn_s1_name = make_temp_file (".mkoffload.1.s");
+  gcn_s2_name = make_temp_file (".mkoffload.2.s");
+  gcn_o_name = make_temp_file (".mkoffload.hsaco");
+  gcn_cfile_name = make_temp_file (".c");
+
+  /* Build arguments for compiler pass.  */
+  struct obstack cc_argv_obstack;
+  obstack_init (&cc_argv_obstack);
+  obstack_ptr_grow (&cc_argv_obstack, driver);
+  obstack_ptr_grow (&cc_argv_obstack, "-S");
+
+  if (save_temps)
+    obstack_ptr_grow (&cc_argv_obstack, "-save-temps");
+  if (verbose)
+    obstack_ptr_grow (&cc_argv_obstack, "-v");
+  obstack_ptr_grow (&cc_argv_obstack, abi);
+  obstack_ptr_grow (&cc_argv_obstack, "-xlto");
+  if (fopenmp)
+    obstack_ptr_grow (&cc_argv_obstack, "-mgomp");
+
+  for (int ix = 1; ix != argc; ix++)
+    {
+      if (!strcmp (argv[ix], "-o") && ix + 1 != argc)
+	outname = argv[++ix];
+      else
+	{
+	  obstack_ptr_grow (&cc_argv_obstack, argv[ix]);
+
+	  if (argv[ix][0] != '-')
+	    offloadsrc = argv[ix];
+	}
+    }
+
+  obstack_ptr_grow (&cc_argv_obstack, "-o");
+  obstack_ptr_grow (&cc_argv_obstack, gcn_s1_name);
+  obstack_ptr_grow (&cc_argv_obstack,
+		    concat ("-mlocal-symbol-id=", offloadsrc, NULL));
+  obstack_ptr_grow (&cc_argv_obstack, NULL);
+  const char **cc_argv = XOBFINISH (&cc_argv_obstack, const char **);
+
+  /* FIXME: remove this hack.
+     Allow an environment override hook for debug purposes.  */
+  const char *override_gcn_s2_name = getenv ("OVERRIDE_GCN_INPUT_ASM");
+
+  /* Build arguments for assemble/link pass.  */
+  struct obstack ld_argv_obstack;
+  obstack_init (&ld_argv_obstack);
+  obstack_ptr_grow (&ld_argv_obstack, driver);
+  obstack_ptr_grow (&ld_argv_obstack, (override_gcn_s2_name ? : gcn_s2_name));
+  obstack_ptr_grow (&ld_argv_obstack, "-lgomp");
+
+  for (int i = 1; i < argc; i++)
+    if (strncmp (argv[i], "-l", 2) == 0
+	|| strncmp (argv[i], "-Wl", 3) == 0
+	|| strncmp (argv[i], "-march", 6) == 0)
+      obstack_ptr_grow (&ld_argv_obstack, argv[i]);
+
+  obstack_ptr_grow (&ld_argv_obstack, "-o");
+  obstack_ptr_grow (&ld_argv_obstack, gcn_o_name);
+  obstack_ptr_grow (&ld_argv_obstack, NULL);
+  const char **ld_argv = XOBFINISH (&ld_argv_obstack, const char **);
+
+  /* Clean up unhelpful environment variables.  */
+  char *execpath = getenv ("GCC_EXEC_PREFIX");
+  char *cpath = getenv ("COMPILER_PATH");
+  char *lpath = getenv ("LIBRARY_PATH");
+  unsetenv ("GCC_EXEC_PREFIX");
+  unsetenv ("COMPILER_PATH");
+  unsetenv ("LIBRARY_PATH");
+
+  /* Run the compiler pass.  */
+  fork_execute (cc_argv[0], CONST_CAST (char **, cc_argv), true);
+  obstack_free (&cc_argv_obstack, NULL);
+
+  in = fopen (gcn_s1_name, "r");
+  if (!in)
+    fatal_error (input_location, "cannot open intermediate gcn asm file");
+
+  out = fopen (gcn_s2_name, "w");
+  if (!out)
+    fatal_error (input_location, "cannot open '%s'", gcn_s2_name);
+
+  cfile = fopen (gcn_cfile_name, "w");
+  if (!cfile)
+    fatal_error (input_location, "cannot open '%s'", gcn_cfile_name);
+
+  process_asm (in, out, cfile);
+
+  fclose (in);
+  fclose (out);
+
+  /* Run the assemble/link pass.  */
+  fork_execute (ld_argv[0], CONST_CAST (char **, ld_argv), true);
+  obstack_free (&ld_argv_obstack, NULL);
+
+  in = fopen (gcn_o_name, "r");
+  if (!in)
+    fatal_error (input_location, "cannot open intermediate gcn obj file");
+
+  process_obj (in, cfile);
+
+  fclose (in);
+  fclose (cfile);
+
+  xputenv (concat ("GCC_EXEC_PREFIX=", execpath, NULL));
+  xputenv (concat ("COMPILER_PATH=", cpath, NULL));
+  xputenv (concat ("LIBRARY_PATH=", lpath, NULL));
+
+  compile_native (gcn_cfile_name, outname, collect_gcc);
+
+  return 0;
+}
diff --git a/gcc/config/gcn/offload.h b/gcc/config/gcn/offload.h
new file mode 100644
index 0000000..94c44e2
--- /dev/null
+++ b/gcc/config/gcn/offload.h
@@ -0,0 +1,35 @@
+/* Support for AMD GCN offloading.
+
+   Copyright (C) 2014-2018 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_GCN_OFFLOAD_H
+#define GCC_GCN_OFFLOAD_H
+
+/* Support for OpenACC acc_on_device.  */
+
+#include "gomp-constants.h"
+
+#define ACCEL_COMPILER_acc_device GOMP_DEVICE_GCN
+
+#endif
diff --git a/gcc/config/gcn/predicates.md b/gcc/config/gcn/predicates.md
new file mode 100644
index 0000000..6d10b1d
--- /dev/null
+++ b/gcc/config/gcn/predicates.md
@@ -0,0 +1,189 @@
+;; Predicate definitions for GCN.
+;; Copyright (C) 2016-2017 Free Software Foundation, Inc.
+;;
+;; This file is part of GCC.
+;;
+;; GCC is free software; you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation; either version 3, or (at your option)
+;; any later version.
+;;
+;; GCC is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+;;
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING3.  If not see
+;; <http://www.gnu.org/licenses/>.
+;; Return true if VALUE can be stored in a sign extended immediate field.
+
+(define_predicate "gcn_conditional_register_operand"
+  (match_operand 0 "register_operand")
+{
+  if (GET_CODE (op) == SUBREG)
+    op = SUBREG_REG (op);
+
+  if (!REG_P (op))
+    return 0;
+
+  return REGNO (op) == VCCZ_REG
+	 || REGNO (op) == SCC_REG
+	 || REGNO (op) == EXECZ_REG
+	 || REGNO (op) >= FIRST_PSEUDO_REGISTER;
+})
+
+(define_predicate "gcn_ssrc_register_operand"
+  (match_operand 0 "register_operand")
+{
+  if (GET_CODE (op) == SUBREG)
+    op = SUBREG_REG (op);
+
+  if (!REG_P (op))
+    return false;
+
+  return SSRC_REGNO_P (REGNO (op)) || REGNO (op) >= FIRST_PSEUDO_REGISTER;
+})
+
+(define_predicate "gcn_sdst_register_operand"
+  (match_operand 0 "register_operand")
+{
+  if (GET_CODE (op) == SUBREG)
+    op = SUBREG_REG (op);
+
+  if (!REG_P (op))
+    return false;
+
+  return SDST_REGNO_P (REGNO (op)) || REGNO (op) >= FIRST_PSEUDO_REGISTER;
+})
+
+(define_predicate "gcn_vgpr_register_operand"
+  (match_operand 0 "register_operand")
+{
+  if (GET_CODE (op) == SUBREG)
+    op = SUBREG_REG (op);
+
+  if (!REG_P (op))
+    return false;
+
+  return VGPR_REGNO_P (REGNO (op)) || REGNO (op) >= FIRST_PSEUDO_REGISTER;
+})
+
+(define_predicate "gcn_inline_immediate_operand"
+  (match_code "const_int,const_double,const_vector")
+{
+  return gcn_inline_constant_p (op);
+})
+
+(define_predicate "gcn_vop3_operand"
+  (ior (match_operand 0 "gcn_inline_immediate_operand")
+       (match_operand 0 "register_operand")))
+
+(define_predicate "gcn_vec0_operand"
+  (match_code "const_vector")
+{
+  return CONST_VECTOR_ELT (op, 0) == const0_rtx && gcn_inline_constant_p (op);
+})
+
+(define_predicate "gcn_vec1_operand"
+  (match_code "const_vector")
+{
+  return CONST_VECTOR_ELT (op, 0) == const1_rtx && gcn_inline_constant_p (op);
+})
+
+(define_predicate "gcn_vec1d_operand"
+  (match_code "const_vector")
+{
+  if (!gcn_inline_constant_p (op))
+    return false;
+
+  rtx elem = CONST_VECTOR_ELT (op, 0);
+  if (!CONST_DOUBLE_P (elem))
+    return false;
+  return real_identical (CONST_DOUBLE_REAL_VALUE (elem), &dconst1);
+})
+
+(define_predicate "gcn_const1d_operand"
+  (match_code "const_double")
+{
+  return gcn_inline_constant_p (op)
+      && real_identical (CONST_DOUBLE_REAL_VALUE (op), &dconst1);
+})
+
+(define_predicate "gcn_32bit_immediate_operand"
+  (match_code "const_int,const_double,const_vector,symbol_ref,label_ref")
+{
+  return gcn_constant_p (op);
+})
+
+; LRA works smoother when exec values are immediate constants
+; prior register allocation.
+(define_predicate "gcn_exec_operand"
+  (ior (match_operand 0 "register_operand")
+       (match_code "const_int")))
+
+(define_predicate "gcn_exec_reg_operand"
+  (match_operand 0 "register_operand"))
+
+(define_predicate "gcn_load_operand"
+  (ior (match_operand 0 "nonimmediate_operand")
+       (match_operand 0 "gcn_32bit_immediate_operand")))
+
+(define_predicate "gcn_alu_operand"
+  (ior (match_operand 0 "register_operand")
+       (match_operand 0 "gcn_32bit_immediate_operand")))
+
+(define_predicate "gcn_ds_memory_operand"
+  (and (match_code "mem")
+       (and (match_test "AS_LDS_P (MEM_ADDR_SPACE (op)) || AS_GDS_P (MEM_ADDR_SPACE (op))")
+	    (match_operand 0 "memory_operand"))))
+
+(define_predicate "gcn_valu_dst_operand"
+  (ior (match_operand 0 "register_operand")
+       (match_operand 0 "gcn_ds_memory_operand")))
+
+(define_predicate "gcn_valu_src0_operand"
+  (ior (match_operand 0 "register_operand")
+       (ior (match_operand 0 "gcn_32bit_immediate_operand")
+	    (match_operand 0 "gcn_ds_memory_operand"))))
+
+(define_predicate "gcn_valu_src1_operand"
+  (match_operand 0 "register_operand"))
+
+(define_predicate "gcn_valu_src1com_operand"
+  (ior (match_operand 0 "register_operand")
+       (match_operand 0 "gcn_32bit_immediate_operand")))
+
+(define_predicate "gcn_conditional_operator"
+  (match_code "eq,ne"))
+
+(define_predicate "gcn_compare_64bit_operator"
+  (match_code "eq,ne"))
+
+(define_predicate "gcn_compare_operator"
+  (match_code "eq,ne,gt,ge,lt,le,gtu,geu,ltu,leu"))
+
+(define_predicate "gcn_fp_compare_operator"
+  (match_code "eq,ne,gt,ge,lt,le,gtu,geu,ltu,leu,ordered,unordered"))
+
+(define_predicate "unary_operator"
+  (match_code "not,popcount"))
+
+(define_predicate "binary_operator"
+  (match_code "and,ior,xor,ashift,lshiftrt,ashiftrt,smin,smax,umin,umax"))
+
+(define_predicate "gcn_register_or_unspec_operand"
+  (ior (match_operand 0 "register_operand")
+       (and (match_code "unspec")
+            (match_test "XINT (op, 1) == UNSPEC_VECTOR"))))
+
+(define_predicate "gcn_alu_or_unspec_operand"
+  (ior (match_operand 0 "gcn_alu_operand")
+       (and (match_code "unspec")
+            (match_test "XINT (op, 1) == UNSPEC_VECTOR"))))
+
+(define_predicate "gcn_register_ds_or_unspec_operand"
+  (ior (match_operand 0 "register_operand")
+       (ior (match_operand 0 "gcn_ds_memory_operand")
+	    (and (match_code "unspec")
+              (match_test "XINT (op, 1) == UNSPEC_VECTOR")))))
diff --git a/gcc/config/gcn/t-gcn-hsa b/gcc/config/gcn/t-gcn-hsa
new file mode 100644
index 0000000..da5cd6a
--- /dev/null
+++ b/gcc/config/gcn/t-gcn-hsa
@@ -0,0 +1,51 @@
+#  Copyright (C) 2016-2018 Free Software Foundation, Inc.
+#
+#  This file is free software; you can redistribute it and/or modify it under
+#  the terms of the GNU General Public License as published by the Free
+#  Software Foundation; either version 3 of the License, or (at your option)
+#  any later version.
+#
+#  This file is distributed in the hope that it will be useful, but WITHOUT
+#  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+#  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+#  for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with GCC; see the file COPYING3.  If not see
+#  <http://www.gnu.org/licenses/>.
+
+GTM_H += $(HASH_TABLE_H)
+
+driver-gcn.o: $(srcdir)/config/gcn/driver-gcn.c
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
+CFLAGS-mkoffload.o += $(DRIVER_DEFINES) \
+	-DGCC_INSTALL_NAME=\"$(GCC_INSTALL_NAME)\"
+mkoffload.o: $(srcdir)/config/gcn/mkoffload.c
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+ALL_HOST_OBJS += mkoffload.o
+
+mkoffload$(exeext): mkoffload.o collect-utils.o libcommon-target.a \
+		      $(LIBIBERTY) $(LIBDEPS)
+	+$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ \
+	  mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBS)
+
+CFLAGS-gcn-run.o += -DVERSION_STRING=$(PKGVERSION_s)
+gcn-run.o: $(srcdir)/config/gcn/gcn-run.c
+	$(COMPILE) -x c -std=gnu11 $<
+	$(POSTCOMPILE)
+ALL_HOST_OBJS += gcn-run.o
+
+gcn-run$(exeext): gcn-run.o
+	+$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ $< -ldl
+
+MULTILIB_OPTIONS = march=gfx900
+MULTILIB_DIRNAMES = gcn5
+
+PASSES_EXTRA += $(srcdir)/config/gcn/gcn-passes.def
+gcn-tree.o: $(srcdir)/config/gcn/gcn-tree.c
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+ALL_HOST_OBJS += gcn-tree.o

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-05 13:43 ` [PATCH 21/25] GCN Back-end (part 2/2) Andrew Stubbs
@ 2018-09-05 14:22   ` Joseph Myers
  2018-09-05 14:35     ` Andrew Stubbs
  2018-09-12 13:42     ` Andrew Stubbs
  2018-11-09 19:40   ` Jeff Law
  1 sibling, 2 replies; 187+ messages in thread
From: Joseph Myers @ 2018-09-05 14:22 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

On Wed, 5 Sep 2018, Andrew Stubbs wrote:

> +       warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
> +                   OPT_Wopenacc_dims,
> +                   (dims[GOMP_DIM_VECTOR]
> +                    ? "using vector_length (64), ignoring %d"
> +                    : "using vector_length (64), ignoring runtime setting"),

In cases like this with alternative diagnostic messages using ?:, you need 
to mark up each message with G_() so they both get extracted for 
translation by exgettext.

> +    fatal_error (input_location, "COLLECT_GCC must be set.");

No '.' at end of diagnostic.

> +#define STR "-foffload-abi="

> +           fatal_error (input_location,
> +                        "unrecognizable argument of option " STR);

This concatenation with a macro won't work with exgettext extracting 
messages for translation.

> +    fatal_error (input_location, "cannot open '%s'", gcn_s2_name);

> +    fatal_error (input_location, "cannot open '%s'", gcn_cfile_name);

Use %qs (presuming this code is using the generic diagnostic machinery 
that supports it).

+gcn-run$(exeext): gcn-run.o
+       +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ $< -ldl

I'd expect this to fail on non-Unix configurations that don't have -ldl, 
and thus to need appropriate conditionals / configure tests to avoid that 
build failure.

A new port should add an appropriate entry to contrib/config-list.mk.  
You should also verify that the port does build using that 
contrib/config-list.mk entry, with the same version of GCC, built 
natively, in the PATH, or equivalently that the port builds with the same 
version of GCC, built natively, in the PATH, when you configure with 
--enable-werror-always and the other options config-list.mk uses - this is 
the cross-compiler equivalent of the native use of -Werror in the later 
stages of bootstrap.  (Preferably verify this building for both 32-bit and 
64-bit hosts, since it's easy to have warnings that only show up for one 
but not the other.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-05 14:22   ` Joseph Myers
@ 2018-09-05 14:35     ` Andrew Stubbs
  2018-09-05 14:44       ` Joseph Myers
  2018-09-12 13:42     ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-05 14:35 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc-patches

On 05/09/18 15:22, Joseph Myers wrote:
> +gcn-run$(exeext): gcn-run.o
> +       +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ $< -ldl
> 
> I'd expect this to fail on non-Unix configurations that don't have -ldl,
> and thus to need appropriate conditionals / configure tests to avoid that
> build failure.

We don't support any host system other than x86_64 linux. There are no 
drivers for any other system, and the offloaded datatypes need to be 
binary compatible, so even 32-bit x86 doesn't work.

I suppose someone might choose to compile things on an alternative 
system for running on a compatible system, in which case we'd want to 
simply skip this binary.

How does one normally do this?

> A new port should add an appropriate entry to contrib/config-list.mk.
> You should also verify that the port does build using that
> contrib/config-list.mk entry, with the same version of GCC, built
> natively, in the PATH, or equivalently that the port builds with the same
> version of GCC, built natively, in the PATH, when you configure with
> --enable-werror-always and the other options config-list.mk uses - this is
> the cross-compiler equivalent of the native use of -Werror in the later
> stages of bootstrap.  (Preferably verify this building for both 32-bit and
> 64-bit hosts, since it's easy to have warnings that only show up for one
> but not the other.)

I didn't know about that one.

I see it uses "--enable-languages=all", but GCN is known to fail to 
build libstdc++ (exceptions and static constructors are not 
implemented), so I wouldn't expect the build to succeed.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-05 14:35     ` Andrew Stubbs
@ 2018-09-05 14:44       ` Joseph Myers
  2018-09-11 16:25         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Joseph Myers @ 2018-09-05 14:44 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

On Wed, 5 Sep 2018, Andrew Stubbs wrote:

> I suppose someone might choose to compile things on an alternative system for
> running on a compatible system, in which case we'd want to simply skip this
> binary.
> 
> How does one normally do this?

I'd expect a configure test plus makefile conditionals in the makefile 
fragment.

> > A new port should add an appropriate entry to contrib/config-list.mk.
> > You should also verify that the port does build using that
> > contrib/config-list.mk entry, with the same version of GCC, built
> > natively, in the PATH, or equivalently that the port builds with the same
> > version of GCC, built natively, in the PATH, when you configure with
> > --enable-werror-always and the other options config-list.mk uses - this is
> > the cross-compiler equivalent of the native use of -Werror in the later
> > stages of bootstrap.  (Preferably verify this building for both 32-bit and
> > 64-bit hosts, since it's easy to have warnings that only show up for one
> > but not the other.)
> 
> I didn't know about that one.

See sourcebuild.texi, "Back End", for lists of places to update for a new 
port, which includes config-list.mk in the list of places to update for a 
port being contributed upstream.

> I see it uses "--enable-languages=all", but GCN is known to fail to build
> libstdc++ (exceptions and static constructors are not implemented), so I
> wouldn't expect the build to succeed.

It also uses "make all-gcc", so only the host-side tools need to build 
(without warnings when building with the same version of GCC, except for 
the files that specifically use -Wno-<something>), not any libraries.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
       [not found]   ` <7f5064c3-afc6-b7b5-cade-f03af5b86331@moene.org>
@ 2018-09-05 18:07     ` Janne Blomqvist
  2018-09-19 16:38       ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Janne Blomqvist @ 2018-09-05 18:07 UTC (permalink / raw)
  To: Toon Moene, ams, GCC Patches, Fortran List

Please send fortran patches to the fortran list as well!

On Wed, Sep 5, 2018 at 7:54 PM Toon Moene <toon@moene.org> wrote:

>
>
>
> -------- Forwarded Message --------
> Subject: [PATCH 08/25] Fix co-array allocation
> Date: Wed, 5 Sep 2018 12:49:40 +0100
> From: ams@codesourcery.com
> To: gcc-patches@gcc.gnu.org
>
>
> The Fortran front-end has a bug in which it uses "int" values for "size_t"
> parameters.  I don't know why this isn't problem for all 64-bit
> architectures,
> but GCN ends up with the data in the wrong argument register and/or
> stack slot,
> and bad things happen.
>
> This patch corrects the issue by setting the correct type.
>
> 2018-09-05  Kwok Cheung Yeung  <kcy@codesourcery.com>
>
>         gcc/fortran/
>         * trans-expr.c (gfc_trans_structure_assign): Ensure that
>         integer_zero_node is of sizetype when used as the first
>         argument of a call to _gfortran_caf_register.
>

The argument must be of type size_type_node, not sizetype. Please instead
use

size = build_zero_cst (size_type_node);


>         * trans-intrinsic.c (conv_intrinsic_event_query): Convert computed
>         index to a size_t type.
>

Using integer_type_node is wrong, but the correct type for calculating
array indices (lbound, ubound,  etc.) is not size_type_node but rather
gfc_array_index_type (which in practice maps to ptrdiff_t). So please use
that, and then fold_convert index to size_type_node just before generating
the call to event_query.


>         * trans-stmt.c (gfc_trans_event_post_wait): Likewise.
>

Same here as above.

Thanks,
-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 17/25] Fix Fortran STOP.
       [not found]   ` <c0630914-1252-1391-9bf9-f03434d46f5a@moene.org>
@ 2018-09-05 18:09     ` Janne Blomqvist
  2018-09-12 13:56       ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Janne Blomqvist @ 2018-09-05 18:09 UTC (permalink / raw)
  To: Toon Moene, GCC Patches, ams; +Cc: Fortran List

Same, please send fortran patches to the fortran list as well!

On Wed, Sep 5, 2018 at 7:55 PM Toon Moene <toon@moene.org> wrote:

>
>
>
> -------- Forwarded Message --------
> Subject: [PATCH 17/25] Fix Fortran STOP.
> Date: Wed, 5 Sep 2018 12:51:18 +0100
> From: ams@codesourcery.com
> To: gcc-patches@gcc.gnu.org
>
>
> The minimal libgfortran setup was created for NVPTX, but will also be
> used by
> AMD GCN.
>
> This patch simply removes an assumption that NVPTX is the only user.
> Specifically, NVPTX exit is broken, but AMD GCN exit works just fine.
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>
>         libgfortran/
>         * runtime/minimal.c (exit): Only work around nvptx bugs on nvptx.
> ---
>   libgfortran/runtime/minimal.c | 2 ++
>   1 file changed, 2 insertions(+)
>
>
Ok, thanks.


-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 18/25] Fix interleaving of Fortran stop messages
       [not found]   ` <994a9ec6-2494-9a83-cc84-bd8a551142c5@moene.org>
@ 2018-09-05 18:11     ` Janne Blomqvist
  2018-09-12 13:55       ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Janne Blomqvist @ 2018-09-05 18:11 UTC (permalink / raw)
  To: Toon Moene, GCC Patches, ams; +Cc: Fortran List

On Wed, Sep 5, 2018 at 7:57 PM Toon Moene <toon@moene.org> wrote:

>
>
>
> -------- Forwarded Message --------
> Subject: [PATCH 18/25] Fix interleaving of Fortran stop messages
> Date: Wed, 5 Sep 2018 12:51:19 +0100
> From: ams@codesourcery.com
> To: gcc-patches@gcc.gnu.org
>
>
> Fortran STOP and ERROR STOP use a different function to print the "STOP"
> string
> and the message string.  On GCN this results in out-of-order output, such
> as
> "<msg>ERROR STOP ".
>
> This patch fixes the problem by making estr_write use the proper Fortran
> write,
> not C printf, so both parts are now output the same way.  This also ensures
> that both parts are output to STDERR (not that that means anything on GCN).
>
> 2018-09-05  Kwok Cheung Yeung  <kcy@codesourcery.com>
>
>         libgfortran/
>         * runtime/minimal.c (estr_write): Define in terms of write.
> ---
>   libgfortran/runtime/minimal.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
>
>
Ok, thanks.

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 19/25] GCN libgfortran.
       [not found]   ` <41281e27-ad85-e50c-8fed-6f4f6f18289c@moene.org>
@ 2018-09-05 18:14     ` Janne Blomqvist
  2018-09-06 12:37       ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Janne Blomqvist @ 2018-09-05 18:14 UTC (permalink / raw)
  To: Toon Moene, GCC Patches, ams; +Cc: Fortran List

Please send fortran patches to the fortran list as well!

On Wed, Sep 5, 2018 at 7:56 PM Toon Moene <toon@moene.org> wrote:

>
>
>
> -------- Forwarded Message --------
> Subject: [PATCH 19/25] GCN libgfortran.
> Date: Wed, 5 Sep 2018 12:51:20 +0100
> From: ams@codesourcery.com
> To: gcc-patches@gcc.gnu.org
>
>
> This patch contains the GCN port of libgfortran.  We use the minimal
> configuration created for NVPTX.  That's all that's required, besides the
> target-independent bug fixes posted already.
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>             Kwok Cheung Yeung  <kcy@codesourcery.com>
>             Julian Brown  <julian@codesourcery.com>
>             Tom de Vries  <tom@codesourcery.com>
>
>         libgfortran/
>         * configure.ac: Use minimal mode for amdgcn.
>         * configure: Regenerate.
> ---
>   libgfortran/configure    | 7 ++++---
>   libgfortran/configure.ac | 3 ++-
>   2 files changed, 6 insertions(+), 4 deletions(-)
>
>
>
Ok!

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 19/25] GCN libgfortran.
  2018-09-05 18:14     ` Janne Blomqvist
@ 2018-09-06 12:37       ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-06 12:37 UTC (permalink / raw)
  To: Janne Blomqvist, Toon Moene, GCC Patches; +Cc: Fortran List

On 05/09/18 19:14, Janne Blomqvist wrote:
> Please send fortran patches to the fortran list as well!

Apologies, I was not aware of this.

> Ok!

Thanks, I will commit when the rest of the port is approved.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 05/25] Add sorry_at diagnostic function.
  2018-09-05 13:41     ` David Malcolm
@ 2018-09-11 10:30       ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-11 10:30 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 190 bytes --]

On 05/09/18 14:41, David Malcolm wrote:
> Please add the:
> 
>    auto_diagnostic_group d;
> 
> line to the top of the function.
> 
> OK with that change.

Here's what I committed.

Andrew


[-- Attachment #2: 180911-sorry-at.patch --]
[-- Type: text/x-patch, Size: 1920 bytes --]

Add sorry_at diagnostic function.

The plain "sorry" diagnostic only gives the "current" location, which is
typically the last line of the function or translation unit by time we get to
the back end.

GCN uses "sorry" to report unsupported language features, such as static
constructors, so it's useful to have a "sorry_at" variant.

This patch implements "sorry_at" according to the pattern of the other "at"
variants.

2018-09-11  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* diagnostic-core.h (sorry_at): New prototype.
	* diagnostic.c (sorry_at): New function.

diff --git a/gcc/diagnostic-core.h b/gcc/diagnostic-core.h
index e4ebe00..80ff395 100644
--- a/gcc/diagnostic-core.h
+++ b/gcc/diagnostic-core.h
@@ -96,6 +96,7 @@ extern bool permerror (location_t, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern bool permerror (rich_location *, const char *,
 				   ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void sorry (const char *, ...) ATTRIBUTE_GCC_DIAG(1,2);
+extern void sorry_at (location_t, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void inform (location_t, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void inform (rich_location *, const char *, ...) ATTRIBUTE_GCC_DIAG(2,3);
 extern void inform_n (location_t, unsigned HOST_WIDE_INT, const char *,
diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
index aae0934..8575065 100644
--- a/gcc/diagnostic.c
+++ b/gcc/diagnostic.c
@@ -1443,6 +1443,18 @@ sorry (const char *gmsgid, ...)
   va_end (ap);
 }
 
+/* Same as above, but use location LOC instead of input_location.  */
+void
+sorry_at (location_t loc, const char *gmsgid, ...)
+{
+  auto_diagnostic_group d;
+  va_list ap;
+  va_start (ap, gmsgid);
+  rich_location richloc (line_table, loc);
+  diagnostic_impl (&richloc, -1, gmsgid, &ap, DK_SORRY);
+  va_end (ap);
+}
+
 /* Return true if an error or a "sorry" has been seen.  Various
    processing is disabled after errors.  */
 bool

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 10/25] Convert BImode vectors.
  2018-09-05 12:44       ` Richard Biener
@ 2018-09-11 14:36         ` Andrew Stubbs
  2018-09-12 14:37           ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-11 14:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On 05/09/18 13:43, Richard Biener wrote:
> No.  You might want to look into the x86 backend if there's maybe more tweaks
> needed when using non-vector mask modes.

I tracked it down to the vector alignment configuration.

Apparently the vectorizer likes to build a "truth" vector, but is 
perfectly happy to put it in a non-vector mode. Unfortunately that 
causes TARGET_VECTOR_ALIGNMENT to be called with the non-vector mode, 
which wasn't handled correctly.

I'm testing to see what happens with the reg_equal and reg_equiv 
conversions, but we might be able to drop this patch.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-05 14:44       ` Joseph Myers
@ 2018-09-11 16:25         ` Andrew Stubbs
  2018-09-11 16:41           ` Joseph Myers
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-11 16:25 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc-patches

On 05/09/18 15:44, Joseph Myers wrote:
> On Wed, 5 Sep 2018, Andrew Stubbs wrote:
> 
>> I suppose someone might choose to compile things on an alternative system for
>> running on a compatible system, in which case we'd want to simply skip this
>> binary.
>>
>> How does one normally do this?
> 
> I'd expect a configure test plus makefile conditionals in the makefile
> fragment.

Is it sufficient to simply exclude it from extra_programs in config.gcc?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-11 16:25         ` Andrew Stubbs
@ 2018-09-11 16:41           ` Joseph Myers
  0 siblings, 0 replies; 187+ messages in thread
From: Joseph Myers @ 2018-09-11 16:41 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

On Tue, 11 Sep 2018, Andrew Stubbs wrote:

> On 05/09/18 15:44, Joseph Myers wrote:
> > On Wed, 5 Sep 2018, Andrew Stubbs wrote:
> > 
> > > I suppose someone might choose to compile things on an alternative system
> > > for
> > > running on a compatible system, in which case we'd want to simply skip
> > > this
> > > binary.
> > > 
> > > How does one normally do this?
> > 
> > I'd expect a configure test plus makefile conditionals in the makefile
> > fragment.
> 
> Is it sufficient to simply exclude it from extra_programs in config.gcc?

That should work (given the appropriate configure test somewhere for 
availability of -ldl).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-05 11:49 ` [PATCH 04/25] SPECIAL_REGNO_P ams
  2018-09-05 12:21   ` Joseph Myers
@ 2018-09-11 22:42   ` Jeff Law
  2018-09-12 11:30     ` Andrew Stubbs
  2018-09-12 15:31   ` Richard Henderson
  2 siblings, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-09-11 22:42 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:48 AM, ams@codesourcery.com wrote:
> 
> GCN has some registers which are special purpose, but not "fixed" because we
> want the register allocator to track their usage and select alternatives that
> use different special registers (e.g. scalar cc vs. vector cc).
> 
> Sometimes this leads the regrename pass to ICE.  Quite how it gets confused is
> not well understood, but considering such registers for renaming is surely not
> useful.
> 
> This patch creates a new macro SPECIAL_REGNO_P which disables regrename.  In
> other words, the register is fixed once allocated.
> 
> 2018-09-05  Kwok Cheung Yeung  <kcy@codesourcery.com>
> 
> 	gcc/
> 	* defaults.h (SPECIAL_REGNO_P): Define to false by default.
> 	* regrename.c (check_new_reg_p): Do not rename to a special register.
> 	(rename_chains): Do not rename special registers.
This feels like you're papering over a problem in regrename and/or the
GCN port..  regrename should be checking the predicate and constraints
when it makes changes.  And I think that you're still allowed to refer
to a fixed register in alternatives.

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 06/25] Remove constant vec_select restriction.
  2018-09-05 11:50 ` [PATCH 06/25] Remove constant vec_select restriction ams
@ 2018-09-11 22:44   ` Jeff Law
  0 siblings, 0 replies; 187+ messages in thread
From: Jeff Law @ 2018-09-11 22:44 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:49 AM, ams@codesourcery.com wrote:
> 
> The vec_select operator is documented to require a const_int for the lane
> selector operand, but GCN has an instruction that can select the lane at
> runtime, so it seems reasonable to remove this restriction.
> 
> This patch simply replaces assertions that the operand is constant with early
> exits from the optimizers.  I think it's reasonable that vec_select with a
> non-constant operand cannot be optimized, yet.
> 
> Also included is the necessary documentation tweak.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 
> 	gcc/
> 	* doc/rtl.texi: Adjust vec_select description.
> 	* simplify-rtx.c (simplify_binary_operation_1): Allow VEC_SELECT to use
> 	non-constant selectors.
OK.  Seems like it could go in now since you're just early returning
rather than asserting -- it should affect any in-tree port.

jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-09-05 11:50 ` [PATCH 09/25] Elide repeated RTL elements ams
@ 2018-09-11 22:46   ` Jeff Law
  2018-09-12  8:47     ` Andrew Stubbs
  2018-09-19 17:25     ` Andrew Stubbs
  0 siblings, 2 replies; 187+ messages in thread
From: Jeff Law @ 2018-09-11 22:46 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:49 AM, ams@codesourcery.com wrote:
> 
> GCN's 64-lane vectors tend to make RTL dumps very long.  This patch makes them
> far more bearable by eliding long sequences of the same element into "repeated"
> messages.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 	    Jan Hubicka  <jh@suse.cz>
> 	    Martin Jambor  <mjambor@suse.cz>
> 
> 	* print-rtl.c (print_rtx_operand_codes_E_and_V): Print how many times
> 	the same elements are repeated rather than printing all of them.
Does this need a corresponding change to the RTL front-end so that it
can read the new form?

jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 19/25] GCN libgfortran.
  2018-09-05 11:52 ` [PATCH 19/25] GCN libgfortran ams
       [not found]   ` <41281e27-ad85-e50c-8fed-6f4f6f18289c@moene.org>
@ 2018-09-11 22:47   ` Jeff Law
  1 sibling, 0 replies; 187+ messages in thread
From: Jeff Law @ 2018-09-11 22:47 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:51 AM, ams@codesourcery.com wrote:
> 
> This patch contains the GCN port of libgfortran.  We use the minimal
> configuration created for NVPTX.  That's all that's required, besides the
> target-independent bug fixes posted already.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 	    Kwok Cheung Yeung  <kcy@codesourcery.com>
> 	    Julian Brown  <julian@codesourcery.com>
> 	    Tom de Vries  <tom@codesourcery.com>
> 
> 	libgfortran/
> 	* configure.ac: Use minimal mode for amdgcn.
> 	* configure: Regenerate.
This is OK once the core port has been accepted.

jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-05 11:50 ` [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME ams
@ 2018-09-11 22:56   ` Jeff Law
  2018-09-12 14:43     ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-09-11 22:56 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:48 AM, ams@codesourcery.com wrote:
> 
> The HSA GPU drivers can't cope with binaries that have the same symbol defined
> multiple times, even though the names are not exported.  This happens whenever
> there are file-scope static variables with matching names.  I believe it's also
> an issue with switch tables.
> 
> This is a bug, but outside our control, so we must work around it when multiple
> translation units have the same symbol defined.
> 
> Therefore, we've implemented name mangling via
> TARGET_MANGLE_DECL_ASSEMBLER_NAME, but found some places where the middle-end
> assumes that the decl name matches the name in the source.
> 
> This patch fixes up those cases by falling back to comparing the unmangled
> name, when a lookup fails.
> 
> 2018-09-05  Julian Brown  <julian@codesourcery.com>
> 
> 	gcc/
> 	* cgraphunit.c (handle_alias_pairs): Scan for aliases by DECL_NAME if
> 	decl assembler name doesn't match.
> 
> 	gcc/c-family/
> 	* c-pragma.c (maye_apply_pending_pragma_weaks): Scan for aliases with
> 	DECL_NAME if decl assembler name doesn't match.
This should be fine.  But please verify there's no regressions on the
x86_64 linux target, particularly for the multi-versioning tests  (mv*.c
mv*.C

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-09-11 22:46   ` Jeff Law
@ 2018-09-12  8:47     ` Andrew Stubbs
  2018-09-12 15:14       ` Jeff Law
  2018-09-19 17:25     ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12  8:47 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 11/09/18 23:45, Jeff Law wrote:
> Does this need a corresponding change to the RTL front-end so that it
> can read the new form?

There's an RTL front-end? When did that happen... clearly I've not been 
paying attention.

If it's expected that dumps can be fed back in unmodified then yes, it 
needs to recognise the new output.

I'll look into it.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-11 22:42   ` Jeff Law
@ 2018-09-12 11:30     ` Andrew Stubbs
  2018-09-13 10:03       ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12 11:30 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 11/09/18 23:42, Jeff Law wrote:
> This feels like you're papering over a problem in regrename and/or the
> GCN port..  regrename should be checking the predicate and constraints
> when it makes changes.  And I think that you're still allowed to refer
> to a fixed register in alternatives.

I think you're allowed to use a constraint to match an already-present 
hardreg, fixed or otherwise, but my understanding is that LRA will never 
convert a pseudoreg to a fixed hardreg, no matter what the constraint says.

Just to make sure, I just tried to fix EXEC (the only register matching 
the "e" constraint, and one of the "special" ones), and as expected the 
compiler blows up with "unable to generate reloads for ...".

Anyway, back to the issue of SPECIAL_REGNO_P ...

I've just retested the motivating example that we had, and that no 
longer fails in regrename.  That could be because the problem is fixed, 
or simply that the compiler no longer generates the exact instruction 
sequence that demonstrates the problem.

If I can't reproduce the issue then this macro becomes just a small 
compile-time optimization and we can remove it safely.

I'll report back when I've done more testing.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-05 14:22   ` Joseph Myers
  2018-09-05 14:35     ` Andrew Stubbs
@ 2018-09-12 13:42     ` Andrew Stubbs
  2018-09-12 15:32       ` Joseph Myers
  1 sibling, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12 13:42 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1001 bytes --]

On 05/09/18 15:22, Joseph Myers wrote:
> In cases like this with alternative diagnostic messages using ?:, you need
> to mark up each message with G_() so they both get extracted for
> translation by exgettext.
> 
[...]
> 
> This concatenation with a macro won't work with exgettext extracting
> messages for translation.
> 
[...]
> 
> Use %qs (presuming this code is using the generic diagnostic machinery
> that supports it).
> 
> +gcn-run$(exeext): gcn-run.o
> +       +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ $< -ldl
> 
> I'd expect this to fail on non-Unix configurations that don't have -ldl,
> and thus to need appropriate conditionals / configure tests to avoid that
> build failure.

The attached diff from the previous patch should address these issues, I 
hope. If they're OK I'll incorporate the changes into the next version 
of the (much) larger patch when I next post them.

> A new port should add an appropriate entry to contrib/config-list.mk.

I'm still testing this.

Andrew

[-- Attachment #2: 180912-fix-gcn-review-issues.patch --]
[-- Type: text/x-patch, Size: 7726 bytes --]

diff --git a/gcc/config.gcc b/gcc/config.gcc
index d28bee5..3d7aa43 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -1387,7 +1387,13 @@ amdgcn-*-amdhsa)
 	extra_modes=gcn/gcn-modes.def
 	extra_objs="${extra_objs} gcn-tree.o"
 	extra_gcc_objs="driver-gcn.o"
-	extra_programs="${extra_programs} gcn-run\$(exeext)"
+	case "$host" in
+	x86_64*-*-linux-gnu )
+		if test "$ac_res" != no; then
+			extra_programs="${extra_programs} gcn-run\$(exeext)"
+		fi
+		;;
+	esac
 	if test x$enable_as_accelerator = xyes; then
 		extra_programs="${extra_programs} mkoffload\$(exeext)"
 		tm_file="${tm_file} gcn/offload.h"
diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index ce03d5b..67cf907 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -48,6 +48,7 @@
 #include "print-rtl.h"
 #include "attribs.h"
 #include "varasm.h"
+#include "intl.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -4614,8 +4615,9 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
 		    OPT_Wopenacc_dims,
 		    (dims[GOMP_DIM_VECTOR]
-		     ? "using vector_length (64), ignoring %d"
-		     : "using vector_length (64), ignoring runtime setting"),
+		     ? G_("using vector_length (64), ignoring %d")
+		     : G_("using vector_length (64), "
+			  "ignoring runtime setting")),
 		    dims[GOMP_DIM_VECTOR]);
       dims[GOMP_DIM_VECTOR] = 1;
       changed = true;
diff --git a/gcc/config/gcn/mkoffload.c b/gcc/config/gcn/mkoffload.c
index 57e0f25..d3b5b96 100644
--- a/gcc/config/gcn/mkoffload.c
+++ b/gcc/config/gcn/mkoffload.c
@@ -489,7 +489,7 @@ main (int argc, char **argv)
 
   char *collect_gcc = getenv ("COLLECT_GCC");
   if (collect_gcc == NULL)
-    fatal_error (input_location, "COLLECT_GCC must be set.");
+    fatal_error (input_location, "COLLECT_GCC must be set");
   const char *gcc_path = dirname (ASTRDUP (collect_gcc));
   const char *gcc_exec = basename (ASTRDUP (collect_gcc));
 
@@ -555,7 +555,7 @@ main (int argc, char **argv)
 	    offload_abi = OFFLOAD_ABI_ILP32;
 	  else
 	    fatal_error (input_location,
-			 "unrecognizable argument of option " STR);
+			 "unrecognizable argument of option %s", argv[i]);
 	}
 #undef STR
       else if (strcmp (argv[i], "-fopenmp") == 0)
@@ -663,11 +663,11 @@ main (int argc, char **argv)
 
   out = fopen (gcn_s2_name, "w");
   if (!out)
-    fatal_error (input_location, "cannot open '%s'", gcn_s2_name);
+    fatal_error (input_location, "cannot open %qs", gcn_s2_name);
 
   cfile = fopen (gcn_cfile_name, "w");
   if (!cfile)
-    fatal_error (input_location, "cannot open '%s'", gcn_cfile_name);
+    fatal_error (input_location, "cannot open %qs", gcn_cfile_name);
 
   process_asm (in, out, cfile);
 
diff --git a/gcc/configure b/gcc/configure
index b7a8e36..4123c2a 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -746,6 +746,7 @@ manext
 LIBICONV_DEP
 LTLIBICONV
 LIBICONV
+DL_LIB
 LDEXP_LIB
 EXTRA_GCC_LIBS
 GNAT_LIBEXC
@@ -9643,6 +9644,69 @@ LDEXP_LIB="$LIBS"
 LIBS="$save_LIBS"
 
 
+# Some systems need dlopen
+save_LIBS="$LIBS"
+LIBS=
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing dlopen" >&5
+$as_echo_n "checking for library containing dlopen... " >&6; }
+if test "${ac_cv_search_dlopen+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char dlopen ();
+int
+main ()
+{
+return dlopen ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' dl; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_cxx_try_link "$LINENO"; then :
+  ac_cv_search_dlopen=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if test "${ac_cv_search_dlopen+set}" = set; then :
+  break
+fi
+done
+if test "${ac_cv_search_dlopen+set}" = set; then :
+
+else
+  ac_cv_search_dlopen=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_dlopen" >&5
+$as_echo "$ac_cv_search_dlopen" >&6; }
+ac_res=$ac_cv_search_dlopen
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
+DL_LIB="$LIBS"
+LIBS="$save_LIBS"
+
+
 # Use <inttypes.h> only if it exists,
 # doesn't clash with <sys/types.h>, declares intmax_t and defines
 # PRId64
@@ -18460,7 +18524,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 18463 "configure"
+#line 18527 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -18566,7 +18630,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 18569 "configure"
+#line 18633 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -19731,20 +19795,20 @@ if test -z "$aix_libpath"; then aix_libpath="/usr/lib:/lib"; fi
 	      prelink_cmds_CXX='tpldir=Template.dir~
 		rm -rf $tpldir~
 		$CC --prelink_objects --instantiation_dir $tpldir $objs $libobjs $compile_deplibs~
-		compile_command="$compile_command `find $tpldir -name \*.o | $NL2SP`"'
+		compile_command="$compile_command `find $tpldir -name \*.o | sort | $NL2SP`"'
 	      old_archive_cmds_CXX='tpldir=Template.dir~
 		rm -rf $tpldir~
 		$CC --prelink_objects --instantiation_dir $tpldir $oldobjs$old_deplibs~
-		$AR $AR_FLAGS $oldlib$oldobjs$old_deplibs `find $tpldir -name \*.o | $NL2SP`~
+		$AR $AR_FLAGS $oldlib$oldobjs$old_deplibs `find $tpldir -name \*.o | sort | $NL2SP`~
 		$RANLIB $oldlib'
 	      archive_cmds_CXX='tpldir=Template.dir~
 		rm -rf $tpldir~
 		$CC --prelink_objects --instantiation_dir $tpldir $predep_objects $libobjs $deplibs $convenience $postdep_objects~
-		$CC -shared $pic_flag $predep_objects $libobjs $deplibs `find $tpldir -name \*.o | $NL2SP` $postdep_objects $compiler_flags ${wl}-soname ${wl}$soname -o $lib'
+		$CC -shared $pic_flag $predep_objects $libobjs $deplibs `find $tpldir -name \*.o | sort | $NL2SP` $postdep_objects $compiler_flags ${wl}-soname ${wl}$soname -o $lib'
 	      archive_expsym_cmds_CXX='tpldir=Template.dir~
 		rm -rf $tpldir~
 		$CC --prelink_objects --instantiation_dir $tpldir $predep_objects $libobjs $deplibs $convenience $postdep_objects~
-		$CC -shared $pic_flag $predep_objects $libobjs $deplibs `find $tpldir -name \*.o | $NL2SP` $postdep_objects $compiler_flags ${wl}-soname ${wl}$soname ${wl}-retain-symbols-file ${wl}$export_symbols -o $lib'
+		$CC -shared $pic_flag $predep_objects $libobjs $deplibs `find $tpldir -name \*.o | sort | $NL2SP` $postdep_objects $compiler_flags ${wl}-soname ${wl}$soname ${wl}-retain-symbols-file ${wl}$export_symbols -o $lib'
 	      ;;
 	    *) # Version 6 and above use weak symbols
 	      archive_cmds_CXX='$CC -shared $pic_flag $predep_objects $libobjs $deplibs $postdep_objects $compiler_flags ${wl}-soname ${wl}$soname -o $lib'
diff --git a/gcc/configure.ac b/gcc/configure.ac
index 65f9c92..d576198 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -1206,6 +1206,14 @@ LDEXP_LIB="$LIBS"
 LIBS="$save_LIBS"
 AC_SUBST(LDEXP_LIB)
 
+# Some systems need dlopen
+save_LIBS="$LIBS"
+LIBS=
+AC_SEARCH_LIBS(dlopen, dl)
+DL_LIB="$LIBS"
+LIBS="$save_LIBS"
+AC_SUBST(DL_LIB)
+
 # Use <inttypes.h> only if it exists,
 # doesn't clash with <sys/types.h>, declares intmax_t and defines
 # PRId64

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 18/25] Fix interleaving of Fortran stop messages
  2018-09-05 18:11     ` Janne Blomqvist
@ 2018-09-12 13:55       ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12 13:55 UTC (permalink / raw)
  To: Janne Blomqvist, Toon Moene, GCC Patches; +Cc: Fortran List

On 05/09/18 19:11, Janne Blomqvist wrote:
> Ok, thanks.

Committed, thanks.

Apologies again for missing the Fortran list.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 17/25] Fix Fortran STOP.
  2018-09-05 18:09     ` Janne Blomqvist
@ 2018-09-12 13:56       ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12 13:56 UTC (permalink / raw)
  To: Janne Blomqvist, Toon Moene, GCC Patches; +Cc: Fortran List

On 05/09/18 19:09, Janne Blomqvist wrote:
> Ok, thanks.

Committed, thanks.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 10/25] Convert BImode vectors.
  2018-09-11 14:36         ` Andrew Stubbs
@ 2018-09-12 14:37           ` Richard Biener
  0 siblings, 0 replies; 187+ messages in thread
From: Richard Biener @ 2018-09-12 14:37 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: GCC Patches

On Tue, Sep 11, 2018 at 4:36 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 05/09/18 13:43, Richard Biener wrote:
> > No.  You might want to look into the x86 backend if there's maybe more tweaks
> > needed when using non-vector mask modes.
>
> I tracked it down to the vector alignment configuration.
>
> Apparently the vectorizer likes to build a "truth" vector, but is
> perfectly happy to put it in a non-vector mode. Unfortunately that
> causes TARGET_VECTOR_ALIGNMENT to be called with the non-vector mode,
> which wasn't handled correctly.
>
> I'm testing to see what happens with the reg_equal and reg_equiv
> conversions, but we might be able to drop this patch.

That's good news!

> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-11 22:56   ` Jeff Law
@ 2018-09-12 14:43     ` Richard Biener
  2018-09-12 15:07       ` Jeff Law
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-12 14:43 UTC (permalink / raw)
  To: Jeff Law, Jan Hubicka; +Cc: Stubbs, Andrew, GCC Patches

On Wed, Sep 12, 2018 at 12:56 AM Jeff Law <law@redhat.com> wrote:
>
> On 9/5/18 5:48 AM, ams@codesourcery.com wrote:
> >
> > The HSA GPU drivers can't cope with binaries that have the same symbol defined
> > multiple times, even though the names are not exported.  This happens whenever
> > there are file-scope static variables with matching names.  I believe it's also
> > an issue with switch tables.
> >
> > This is a bug, but outside our control, so we must work around it when multiple
> > translation units have the same symbol defined.
> >
> > Therefore, we've implemented name mangling via
> > TARGET_MANGLE_DECL_ASSEMBLER_NAME, but found some places where the middle-end
> > assumes that the decl name matches the name in the source.
> >
> > This patch fixes up those cases by falling back to comparing the unmangled
> > name, when a lookup fails.
> >
> > 2018-09-05  Julian Brown  <julian@codesourcery.com>
> >
> >       gcc/
> >       * cgraphunit.c (handle_alias_pairs): Scan for aliases by DECL_NAME if
> >       decl assembler name doesn't match.
> >
> >       gcc/c-family/
> >       * c-pragma.c (maye_apply_pending_pragma_weaks): Scan for aliases with
> >       DECL_NAME if decl assembler name doesn't match.
> This should be fine.  But please verify there's no regressions on the
> x86_64 linux target, particularly for the multi-versioning tests  (mv*.c
> mv*.C

Err - the patch clearly introduces quadraticness into a path which
isn't acceptable.
get_for_asmname works through a hashtable.

It also looks like !target can readily happen so I wonder what happens if an
assembler name does not match but a DECL_NAME one does by accident?

I fear you have to fix this one in a different way... (and I hope
Honza agrees with me).

Thanks,
Richard.


> Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-12 14:43     ` Richard Biener
@ 2018-09-12 15:07       ` Jeff Law
  2018-09-12 15:16         ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-09-12 15:07 UTC (permalink / raw)
  To: Richard Biener, Jan Hubicka; +Cc: Stubbs, Andrew, GCC Patches

On 9/12/18 8:42 AM, Richard Biener wrote:
> On Wed, Sep 12, 2018 at 12:56 AM Jeff Law <law@redhat.com> wrote:
>>
>> On 9/5/18 5:48 AM, ams@codesourcery.com wrote:
>>>
>>> The HSA GPU drivers can't cope with binaries that have the same symbol defined
>>> multiple times, even though the names are not exported.  This happens whenever
>>> there are file-scope static variables with matching names.  I believe it's also
>>> an issue with switch tables.
>>>
>>> This is a bug, but outside our control, so we must work around it when multiple
>>> translation units have the same symbol defined.
>>>
>>> Therefore, we've implemented name mangling via
>>> TARGET_MANGLE_DECL_ASSEMBLER_NAME, but found some places where the middle-end
>>> assumes that the decl name matches the name in the source.
>>>
>>> This patch fixes up those cases by falling back to comparing the unmangled
>>> name, when a lookup fails.
>>>
>>> 2018-09-05  Julian Brown  <julian@codesourcery.com>
>>>
>>>       gcc/
>>>       * cgraphunit.c (handle_alias_pairs): Scan for aliases by DECL_NAME if
>>>       decl assembler name doesn't match.
>>>
>>>       gcc/c-family/
>>>       * c-pragma.c (maye_apply_pending_pragma_weaks): Scan for aliases with
>>>       DECL_NAME if decl assembler name doesn't match.
>> This should be fine.  But please verify there's no regressions on the
>> x86_64 linux target, particularly for the multi-versioning tests  (mv*.c
>> mv*.C
> 
> Err - the patch clearly introduces quadraticness into a path which
> isn't acceptable.
> get_for_asmname works through a hashtable.
But isn't this only being rused when we aren't able to find the symbol?
 My impression was that should be rare, except for the GCN target.

> 
> It also looks like !target can readily happen so I wonder what happens if an
> assembler name does not match but a DECL_NAME one does by accident?
> 
> I fear you have to fix this one in a different way... (and I hope
> Honza agrees with me).
Honza certainly knows the code better than I.  If  he thinks there's a
performance issue and this needs to be resolved a better way, then I'll
go along with that.

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-09-12  8:47     ` Andrew Stubbs
@ 2018-09-12 15:14       ` Jeff Law
  0 siblings, 0 replies; 187+ messages in thread
From: Jeff Law @ 2018-09-12 15:14 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 9/12/18 2:46 AM, Andrew Stubbs wrote:
> On 11/09/18 23:45, Jeff Law wrote:
>> Does this need a corresponding change to the RTL front-end so that it
>> can read the new form?
> 
> There's an RTL front-end? When did that happen... clearly I've not been
> paying attention.
Within the last couple years.  It's primarily for testing purposes so
that we can take RTL, feed it back into a pass and then look at the
results.  There's some tests in the testsuite you can poke at to see how
it works -- they're probably also a good starting point for a test that
we can properly reconstruct RTL with elided elements.


> 
> If it's expected that dumps can be fed back in unmodified then yes, it
> needs to recognise the new output.
I don't think it goes in strictly unmodified, but ISTM we need to be
able to handle this case.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-12 15:07       ` Jeff Law
@ 2018-09-12 15:16         ` Richard Biener
  2018-09-12 16:32           ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-12 15:16 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jan Hubicka, Stubbs, Andrew, GCC Patches

On Wed, Sep 12, 2018 at 5:07 PM Jeff Law <law@redhat.com> wrote:
>
> On 9/12/18 8:42 AM, Richard Biener wrote:
> > On Wed, Sep 12, 2018 at 12:56 AM Jeff Law <law@redhat.com> wrote:
> >>
> >> On 9/5/18 5:48 AM, ams@codesourcery.com wrote:
> >>>
> >>> The HSA GPU drivers can't cope with binaries that have the same symbol defined
> >>> multiple times, even though the names are not exported.  This happens whenever
> >>> there are file-scope static variables with matching names.  I believe it's also
> >>> an issue with switch tables.
> >>>
> >>> This is a bug, but outside our control, so we must work around it when multiple
> >>> translation units have the same symbol defined.
> >>>
> >>> Therefore, we've implemented name mangling via
> >>> TARGET_MANGLE_DECL_ASSEMBLER_NAME, but found some places where the middle-end
> >>> assumes that the decl name matches the name in the source.
> >>>
> >>> This patch fixes up those cases by falling back to comparing the unmangled
> >>> name, when a lookup fails.
> >>>
> >>> 2018-09-05  Julian Brown  <julian@codesourcery.com>
> >>>
> >>>       gcc/
> >>>       * cgraphunit.c (handle_alias_pairs): Scan for aliases by DECL_NAME if
> >>>       decl assembler name doesn't match.
> >>>
> >>>       gcc/c-family/
> >>>       * c-pragma.c (maye_apply_pending_pragma_weaks): Scan for aliases with
> >>>       DECL_NAME if decl assembler name doesn't match.
> >> This should be fine.  But please verify there's no regressions on the
> >> x86_64 linux target, particularly for the multi-versioning tests  (mv*.c
> >> mv*.C
> >
> > Err - the patch clearly introduces quadraticness into a path which
> > isn't acceptable.
> > get_for_asmname works through a hashtable.
> But isn't this only being rused when we aren't able to find the symbol?

This case seems to happen though.

>  My impression was that should be rare, except for the GCN target.

Still even for the GCN target it looks like a hack given the linear search.

I think it is required to track down the "invalid" uses of DECL_NAME vs.
"mangled" name instead.

> >
> > It also looks like !target can readily happen so I wonder what happens if an
> > assembler name does not match but a DECL_NAME one does by accident?
> >
> > I fear you have to fix this one in a different way... (and I hope
> > Honza agrees with me).
> Honza certainly knows the code better than I.  If  he thinks there's a
> performance issue and this needs to be resolved a better way, then I'll
> go along with that.

I think the symptom GCN sees needs to be better understood - like wheter
it is generally OK to mangle things arbitrarily.

Note that TARGET_MANGLE_DECL_ASSEMBLER_NAME might not be
a general symbol mangling hook but may be restricted to symbols with
specific visibility.

Richard.

>
> Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-05 11:49 ` [PATCH 04/25] SPECIAL_REGNO_P ams
  2018-09-05 12:21   ` Joseph Myers
  2018-09-11 22:42   ` Jeff Law
@ 2018-09-12 15:31   ` Richard Henderson
  2018-09-12 16:14     ` Andrew Stubbs
  2 siblings, 1 reply; 187+ messages in thread
From: Richard Henderson @ 2018-09-12 15:31 UTC (permalink / raw)
  To: ams, gcc-patches

On 09/05/2018 04:48 AM, ams@codesourcery.com wrote:
> @@ -1198,6 +1198,10 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
>  #define NO_FUNCTION_CSE false
>  #endif
>  
> +#ifndef SPECIAL_REGNO_P
> +#define SPECIAL_REGNO_P(REGNO) false
> +#endif
> +
>  #ifndef HARD_REGNO_RENAME_OK
>  #define HARD_REGNO_RENAME_OK(FROM, TO) true
>  #endif
...

> @@ -320,6 +320,7 @@ check_new_reg_p (int reg ATTRIBUTE_UNUSED, int new_reg,
>      if (TEST_HARD_REG_BIT (this_unavailable, new_reg + i)
>  	|| fixed_regs[new_reg + i]
>  	|| global_regs[new_reg + i]
> +	|| SPECIAL_REGNO_P (new_reg + i)
>  	/* Can't use regs which aren't saved by the prologue.  */
>  	|| (! df_regs_ever_live_p (new_reg + i)
>  	    && ! call_used_regs[new_reg + i])

How is this different from HARD_REGNO_RENAME_OK via the TO argument?
Seems like the hook you're looking for already exists...


r~

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-12 13:42     ` Andrew Stubbs
@ 2018-09-12 15:32       ` Joseph Myers
  2018-09-12 16:46         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Joseph Myers @ 2018-09-12 15:32 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

On Wed, 12 Sep 2018, Andrew Stubbs wrote:

> > I'd expect this to fail on non-Unix configurations that don't have -ldl,
> > and thus to need appropriate conditionals / configure tests to avoid that
> > build failure.
> 
> The attached diff from the previous patch should address these issues, I hope.
> If they're OK I'll incorporate the changes into the next version of the (much)
> larger patch when I next post them.

> +	case "$host" in
> +	x86_64*-*-linux-gnu )
> +		if test "$ac_res" != no; then
> +			extra_programs="${extra_programs} gcn-run\$(exeext)"
> +		fi

ac_res is a generic autoconf variable used by a lot of tests.  I don't 
think it's at all safe to embed an assumption into the middle of 
config.gcc about which the last test run that sets such a variable is; you 
need a variable explicitly relating to whatever the relevant test is.

What if the host is x86_64 with the x32 ABI?  If the requirement is for 
various types to be the same between the host and GCN, I'd expect that x32 
ABI on the host means it is unsuitable for using gcn-run.  Or are the 
requirements for compatible types between some other two pieces, so that 
an x32 gcn-run is OK?

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-12 15:31   ` Richard Henderson
@ 2018-09-12 16:14     ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12 16:14 UTC (permalink / raw)
  To: Richard Henderson, gcc-patches

On 12/09/18 16:31, Richard Henderson wrote:
> How is this different from HARD_REGNO_RENAME_OK via the TO argument?
> Seems like the hook you're looking for already exists...

I don't know how we got here (I didn't do the original work), but the 
SPECIAL_REGNO_P was indeed used in HARD_REGNO_RENAME_OK, as well as in 
some extra places in regrename.

This definitely was necessary at some point, but I've not yet reproduced 
the issue in the current code-base, so I suspect you may be correct.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-12 15:16         ` Richard Biener
@ 2018-09-12 16:32           ` Andrew Stubbs
  2018-09-12 17:39             ` Julian Brown
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12 16:32 UTC (permalink / raw)
  To: Richard Biener, Jeff Law, Julian Brown; +Cc: Jan Hubicka, GCC Patches

On 12/09/18 16:16, Richard Biener wrote:
> I think the symptom GCN sees needs to be better understood - like wheter
> it is generally OK to mangle things arbitrarily.

The name mangling is a horrible workaround for a bug in the HSA runtime 
code (which we do not own, cannot fix, and would want to support old 
versions of anyway).  Basically it refuses to load any binary that has 
the same symbol as another, already loaded, binary, regardless of the 
symbol's linkage.  Worse, it also rejects any binary that has duplicate 
symbols within it, despite the fact that it already linked just fine.

Adding the extra lookups is enough to build GCN binaries, with mangled 
names, whereas the existing name mangling support was either more 
specialized or bit rotten (I don't know which).

It may well be that there's a better way to solve the problem, or at 
least to do the lookups.

It may also be that there are some unintended consequences, such as 
false name matches, but I don't know of any at present.

Julian, can you comment, please?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-12 15:32       ` Joseph Myers
@ 2018-09-12 16:46         ` Andrew Stubbs
  2018-09-12 16:50           ` Joseph Myers
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-12 16:46 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc-patches

On 12/09/18 16:32, Joseph Myers wrote:
>> +	case "$host" in
>> +	x86_64*-*-linux-gnu )
>> +		if test "$ac_res" != no; then
>> +			extra_programs="${extra_programs} gcn-run\$(exeext)"
>> +		fi
> 
> ac_res is a generic autoconf variable used by a lot of tests.  I don't
> think it's at all safe to embed an assumption into the middle of
> config.gcc about which the last test run that sets such a variable is; you
> need a variable explicitly relating to whatever the relevant test is.

Oops, EWRONGPATCH. That's supposed to be "ac_cv_search_dlopen".

> What if the host is x86_64 with the x32 ABI?  If the requirement is for
> various types to be the same between the host and GCN, I'd expect that x32
> ABI on the host means it is unsuitable for using gcn-run.  Or are the
> requirements for compatible types between some other two pieces, so that
> an x32 gcn-run is OK?

No, x32 would not be ok. The test as is rejects x86_64-*-linux-gnux32, 
so that ought not to be a problem unless somebody has an x32 system with 
the default triplet.

Is that something we really need to care about? If so then I guess a new 
configure test is needed.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-12 16:46         ` Andrew Stubbs
@ 2018-09-12 16:50           ` Joseph Myers
  0 siblings, 0 replies; 187+ messages in thread
From: Joseph Myers @ 2018-09-12 16:50 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

On Wed, 12 Sep 2018, Andrew Stubbs wrote:

> > What if the host is x86_64 with the x32 ABI?  If the requirement is for
> > various types to be the same between the host and GCN, I'd expect that x32
> > ABI on the host means it is unsuitable for using gcn-run.  Or are the
> > requirements for compatible types between some other two pieces, so that
> > an x32 gcn-run is OK?
> 
> No, x32 would not be ok. The test as is rejects x86_64-*-linux-gnux32, so that
> ought not to be a problem unless somebody has an x32 system with the default
> triplet.

I don't see anything in config.guess that would create such a name 
(config.guess tries to avoid testing $CC as much as possible, and testing 
the ABI used by $CC would be necessary to distinguish x32), so while 
people can use such a triplet to change the default ABI for the target, I 
wouldn't expect it necessarily to be used for the host.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-12 16:32           ` Andrew Stubbs
@ 2018-09-12 17:39             ` Julian Brown
  2018-09-15  6:01               ` Julian Brown
  0 siblings, 1 reply; 187+ messages in thread
From: Julian Brown @ 2018-09-12 17:39 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: Richard Biener, Jeff Law, Jan Hubicka, GCC Patches

On Wed, 12 Sep 2018 17:31:58 +0100
Andrew Stubbs <ams@codesourcery.com> wrote:

> On 12/09/18 16:16, Richard Biener wrote:
> > I think the symptom GCN sees needs to be better understood - like
> > wheter it is generally OK to mangle things arbitrarily.  
> 
> The name mangling is a horrible workaround for a bug in the HSA
> runtime code (which we do not own, cannot fix, and would want to
> support old versions of anyway).  Basically it refuses to load any
> binary that has the same symbol as another, already loaded, binary,
> regardless of the symbol's linkage.  Worse, it also rejects any
> binary that has duplicate symbols within it, despite the fact that it
> already linked just fine.
> 
> Adding the extra lookups is enough to build GCN binaries, with
> mangled names, whereas the existing name mangling support was either
> more specialized or bit rotten (I don't know which).
> 
> It may well be that there's a better way to solve the problem, or at 
> least to do the lookups.
> 
> It may also be that there are some unintended consequences, such as 
> false name matches, but I don't know of any at present.
> 
> Julian, can you comment, please?

I did the local-symbol name mangling in two places:

- The ASM_FORMAT_PRIVATE_NAME macro (good for local statics)
- The TARGET_MANGLE_DECL_ASSEMBLER_NAME hook (for file-scope
  local/statics)

Possibly, this was an abuse of these hooks, but it's arguably wrong that
that e.g. handle_alias_pairs has the "assembler name" leak through into
the user's source code -- if it's expected that the hook could make
arbitrary transformations to the string. (The latter hook is only used
by PE code for x86 at present, by the look of it, and the default
handles only special-purpose mangling indicated by placing a '*' at the
front of the symbol.)

I couldn't find an existing place where the DECL_NAMEs for symbols were
indexed in a hash table, equivalent to the table for assembler names.
Aliases are made via pragmas, so it's not 100% clear to me what the
scoping/lookup rules are supposed to be for those anyway, nor what the
possibility or consequences might be of false matches.

(The "!target" case in maybe_apply_pending_pragma_weaks, if it doesn't
somehow make a false match, just slows down reporting of an error a
little, I think. Similarly in handle_alias_pairs.)

If we had a symtab_node::get_for_name () using a suitable hash table, I
think it'd probably be right to use that. Can that be done (easily), or
is there some equivalent way? Introducing a new hash table everywhere
for a bug workaround for a relatively obscure feature on a single
target seems unfortunate.

Thanks,

Julian

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-12 11:30     ` Andrew Stubbs
@ 2018-09-13 10:03       ` Andrew Stubbs
  2018-09-13 14:14         ` Andrew Stubbs
  2018-10-04 19:13         ` Jeff Law
  0 siblings, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-13 10:03 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 12/09/18 12:29, Andrew Stubbs wrote:
> I'll report back when I've done more testing.

I reproduced the problem, in the latest sources, with the 
SPECIAL_REGNO_P patch removed (and HARD_REGNO_RENAME_OK adjusted 
accordingly).

Testcase: gcc.c-torture/compile/20020706-2.c -O3 -funroll-loops

> during RTL pass: rnreg
> dump file: /scratch/astubbs/amd/upstream/tmp/target.290r.rnreg
> /scratch/astubbs/amd/upstream/src/gcc-gcn-master/gcc/testsuite/gcc.c-torture/compile/20020706-2.c: In function 'crashIt':                                                                                                                     
> /scratch/astubbs/amd/upstream/src/gcc-gcn-master/gcc/testsuite/gcc.c-torture/compile/20020706-2.c:26:1: internal compiler error: in merge_overlapping_regs, at regrename.c:300                                                                
> 26 | }
>    | ^
> 0xef149d merge_overlapping_regs
>         /scratch/astubbs/amd/upstream/src/gcc-gcn-master/gcc/regrename.c:300
> 0xef17cb find_rename_reg(du_head*, reg_class, unsigned long (*) [7], int, bool)
>         /scratch/astubbs/amd/upstream/src/gcc-gcn-master/gcc/regrename.c:373
> 0xef1c84 rename_chains
>         /scratch/astubbs/amd/upstream/src/gcc-gcn-master/gcc/regrename.c:497
> 0xef612b regrename_optimize
>         /scratch/astubbs/amd/upstream/src/gcc-gcn-master/gcc/regrename.c:1951
> 0xef61ae execute
>         /scratch/astubbs/amd/upstream/src/gcc-gcn-master/gcc/regrename.c:1986
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See <https://sourcery.mentor.com/GNUToolchain/> for instructions.

The register that find_rename_reg is considering is SCC, which is one of 
the "special" registers.  There is a short-cut in rename_chains for 
fixed registers, global registers, and frame pointers.  It does not 
check HARD_REGNO_RENAME_OK.

Presumably the bug is not that it will actually try to rename SCC, but 
that it trips an assert while trying to compute the other parameter for 
the HARD_REGNO_RENAME_OK hook.

The SPECIAL_REGNO_P macro fixed the issue by extending the short-cut to 
include the additional registers.

The assert is caused because the def-use chains indicate that SCC 
conflicts with itself. I suppose the question is why is it doing that, 
but it's probably do do with that being a special register that gets 
used in split2 (particularly by the addptrdi3 pattern). Although, those 
patterns are careful to save SCC to one side and then restore it again 
after, so I'd have thought the DF analysis would work out?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 10:03       ` Andrew Stubbs
@ 2018-09-13 14:14         ` Andrew Stubbs
  2018-09-13 14:39           ` Paul Koning
  2018-09-17 22:59           ` Jeff Law
  2018-10-04 19:13         ` Jeff Law
  1 sibling, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-13 14:14 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 13/09/18 11:01, Andrew Stubbs wrote:
> The assert is caused because the def-use chains indicate that SCC 
> conflicts with itself. I suppose the question is why is it doing that, 
> but it's probably do do with that being a special register that gets 
> used in split2 (particularly by the addptrdi3 pattern). Although, those 
> patterns are careful to save SCC to one side and then restore it again 
> after, so I'd have thought the DF analysis would work out?

I think I may have a theory on this one now....

The addptrdi3 pattern must use two 32-bit adds with a carry in SCC, but 
addptr patterns are not allowed to clobber SCC, so the splitter 
carefully saves and restores the old value.

This is correct at runtime, and looks correct in RTL dumps, but it means 
that there's still a single rtx REG instance holding the live SCC 
register, and its still live before and after the new add instruction.

Would I be right in thinking that the dataflow analysis doesn't like this?

I think I have a work-around (by using different instructions), but is 
there a correct way to do this if there weren't an alternative?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 14:14         ` Andrew Stubbs
@ 2018-09-13 14:39           ` Paul Koning
  2018-09-13 14:49             ` Andrew Stubbs
  2018-09-17 22:59           ` Jeff Law
  1 sibling, 1 reply; 187+ messages in thread
From: Paul Koning @ 2018-09-13 14:39 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: GCC patches



> On Sep 13, 2018, at 10:08 AM, Andrew Stubbs <ams@codesourcery.com> wrote:
> 
> On 13/09/18 11:01, Andrew Stubbs wrote:
>> The assert is caused because the def-use chains indicate that SCC conflicts with itself. I suppose the question is why is it doing that, but it's probably do do with that being a special register that gets used in split2 (particularly by the addptrdi3 pattern). Although, those patterns are careful to save SCC to one side and then restore it again after, so I'd have thought the DF analysis would work out?
> 
> I think I may have a theory on this one now....
> 
> The addptrdi3 pattern must use two 32-bit adds with a carry in SCC, but addptr patterns are not allowed to clobber SCC, so the splitter carefully saves and restores the old value.

If you don't have machine operations that add without messing with condition codes, wouldn't it make sense to omit the definition of the add-pointer patterns?  GCC will build things out of normal (CC-clobbering) adds if there are no add-pointer operations, which may well be more efficient in most cases than explicitly saving/restoring a CC that may in fact not matter right at that spot.

	paul

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 14:39           ` Paul Koning
@ 2018-09-13 14:49             ` Andrew Stubbs
  2018-09-13 14:58               ` Paul Koning
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-13 14:49 UTC (permalink / raw)
  To: Paul Koning; +Cc: GCC patches

On 13/09/18 15:16, Paul Koning wrote:
> If you don't have machine operations that add without messing with
> condition codes, wouldn't it make sense to omit the definition of the
> add-pointer patterns?  GCC will build things out of normal
> (CC-clobbering) adds if there are no add-pointer operations, which
> may well be more efficient in most cases than explicitly
> saving/restoring a CC that may in fact not matter right at that
> spot.

I thought the whole point of addptr is that it *is* needed when add
clobbers CC? As in, LRA spills are malformed without this.

Did something change? The internals manual still says "It only needs to
be defined if addm3 sets the condition code."

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 14:49             ` Andrew Stubbs
@ 2018-09-13 14:58               ` Paul Koning
  2018-09-13 15:22                 ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Paul Koning @ 2018-09-13 14:58 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: GCC patches



> On Sep 13, 2018, at 10:39 AM, Andrew Stubbs <ams@codesourcery.com> wrote:
> 
> On 13/09/18 15:16, Paul Koning wrote:
>> If you don't have machine operations that add without messing with
>> condition codes, wouldn't it make sense to omit the definition of the
>> add-pointer patterns?  GCC will build things out of normal
>> (CC-clobbering) adds if there are no add-pointer operations, which
>> may well be more efficient in most cases than explicitly
>> saving/restoring a CC that may in fact not matter right at that
>> spot.
> 
> I thought the whole point of addptr is that it *is* needed when add
> clobbers CC? As in, LRA spills are malformed without this.
> 
> Did something change? The internals manual still says "It only needs to
> be defined if addm3 sets the condition code."

It's ambiguous, because the last sentence of that paragraph says "addm3 is used if addptrm3 is not defined."  

I don't know of any change in this area.  All I know is that pdp11 has adds that clobber CC and it doesn't define addptrm3, relying on that last sentence.  I've tried LRA and for the most part it compiles successfully, I suppose I should verify the generated code based on the point you raised.  If I really have to define addptr, I'm in trouble because  save/restore CC is not easy on pdp11.

	paul

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 14:58               ` Paul Koning
@ 2018-09-13 15:22                 ` Andrew Stubbs
  2018-09-13 17:13                   ` Paul Koning
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-13 15:22 UTC (permalink / raw)
  To: Paul Koning; +Cc: GCC patches

On 13/09/18 15:49, Paul Koning wrote:
> It's ambiguous, because the last sentence of that paragraph says "addm3 is used if addptrm3 is not defined."

I didn't read that as ambiguous; I read it as addm3 is assumed to work 
fine when addptr is not defined.

> I don't know of any change in this area.  All I know is that pdp11 has adds that clobber CC and it doesn't define addptrm3, relying on that last sentence.  I've tried LRA and for the most part it compiles successfully, I suppose I should verify the generated code based on the point you raised.  If I really have to define addptr, I'm in trouble because  save/restore CC is not easy on pdp11.

The code was added because we had a number of testcases that failed at 
runtime without it.

Admittedly, that was in a GCC 7 code-base, and I can't reproduce the 
failure with one of those test cases now (with addptr deleted), but 
possibly that's just noise.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 15:22                 ` Andrew Stubbs
@ 2018-09-13 17:13                   ` Paul Koning
  0 siblings, 0 replies; 187+ messages in thread
From: Paul Koning @ 2018-09-13 17:13 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: GCC patches



> On Sep 13, 2018, at 10:58 AM, Andrew Stubbs <ams@codesourcery.com> wrote:
> 
> On 13/09/18 15:49, Paul Koning wrote:
>> It's ambiguous, because the last sentence of that paragraph says "addm3 is used if addptrm3 is not defined."
> 
> I didn't read that as ambiguous; I read it as addm3 is assumed to work fine when addptr is not defined.
> 
>> I don't know of any change in this area.  All I know is that pdp11 has adds that clobber CC and it doesn't define addptrm3, relying on that last sentence.  I've tried LRA and for the most part it compiles successfully, I suppose I should verify the generated code based on the point you raised.  If I really have to define addptr, I'm in trouble because  save/restore CC is not easy on pdp11.
> 
> The code was added because we had a number of testcases that failed at runtime without it.
> 
> Admittedly, that was in a GCC 7 code-base, and I can't reproduce the failure with one of those test cases now (with addptr deleted), but possibly that's just noise.

Possibly relevant is that pdp11 is a "type 2" CC setting target, one where the machine description doesn't mention CC until after reload.  So if reload (LRA) is generating adds, the CC effect of that is invisible anyway until later passes that deal with the resulting clobbers and elimination, or not, of compares.

If that's what this is all about, some documentation clarification would help.  Can someone confirm (or refute) my guess?

	paul


^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 01/25] Handle vectors that don't fit in an integer.
  2018-09-05 11:49 ` [PATCH 01/25] Handle vectors that don't fit in an integer ams
  2018-09-05 11:54   ` Jakub Jelinek
@ 2018-09-14 16:03   ` Richard Sandiford
  2018-11-15 17:20     ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-14 16:03 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> GCN vector sizes range between 64 and 512 bytes, none of which have
> correspondingly sized integer modes.  This breaks a number of assumptions
> throughout the compiler, but I don't really want to create modes just for this
> purpose.
>
> Instead, this patch fixes up the cases that I've found, so far, such that the
> compiler tries something else, or fails to optimize, rather than just ICE.
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>             Kwok Cheung Yeung  <kcy@codesourcery.com>
> 	    Jan Hubicka  <jh@suse.cz>
> 	    Martin Jambor  <mjambor@suse.cz>
>
> 	gcc/
> 	* combine.c (gen_lowpart_or_truncate): Return clobber if there is
> 	not a integer mode if the same size as x.
> 	(gen_lowpart_for_combine): Fail if there is no integer mode of the
> 	same size.
> 	* expr.c (expand_expr_real_1): Force first operand to be in memory
> 	if it is a vector register and the result is in	BLKmode.
> 	* tree-vect-stmts.c (vectorizable_store): Don't ICE when
> 	int_mode_for_size fails.
> 	(vectorizable_load): Likewise.
> ---
>  gcc/combine.c         | 13 ++++++++++++-
>  gcc/expr.c            |  8 ++++++++
>  gcc/tree-vect-stmts.c |  8 ++++----
>  3 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/combine.c b/gcc/combine.c
> index a2649b6..cbf9dae 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -8621,7 +8621,13 @@ gen_lowpart_or_truncate (machine_mode mode, rtx x)
>      {
>        /* Bit-cast X into an integer mode.  */
>        if (!SCALAR_INT_MODE_P (GET_MODE (x)))
> -	x = gen_lowpart (int_mode_for_mode (GET_MODE (x)).require (), x);
> +	{
> +	  enum machine_mode imode =
> +	    int_mode_for_mode (GET_MODE (x)).require ();
> +	  if (imode == BLKmode)
> +	    return gen_rtx_CLOBBER (mode, const0_rtx);
> +	  x = gen_lowpart (imode, x);

require () will ICE if there isn't an integer mode and always returns
a scalar_int_mode, so this looks like a no-op.  I think you want
something like:

    scalar_int_mode imode;
    if (!int_mode_for_mode (GET_MODE (x)).exists (&imode))
      ...

> @@ -11698,6 +11704,11 @@ gen_lowpart_for_combine (machine_mode omode, rtx x)
>    if (omode == imode)
>      return x;
>  
> +  /* This can happen when there is no integer mode corresponding
> +     to a size of vector mode.  */
> +  if (omode == BLKmode)
> +    goto fail;
> +
>    /* We can only support MODE being wider than a word if X is a
>       constant integer or has a mode the same size.  */
>    if (maybe_gt (GET_MODE_SIZE (omode), UNITS_PER_WORD)

This seems like it's working around a bug in ther caller.

> diff --git a/gcc/expr.c b/gcc/expr.c
> index cd5cf12..776254a 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -10569,6 +10569,14 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>  			  || maybe_gt (bitpos + bitsize,
>  				       GET_MODE_BITSIZE (mode2)));
>  
> +	/* If the result is in BLKmode and the underlying object is a
> +	   vector in a register, and the size of the vector is larger than
> +	   the largest integer mode, then we must force OP0 to be in memory
> +	   as this is assumed in later code.  */
> +	if (REG_P (op0) && VECTOR_MODE_P (mode2) && mode == BLKmode
> +	    && maybe_gt (bitsize, MAX_FIXED_MODE_SIZE))
> +	  must_force_mem = 1;
> +
>  	/* Handle CONCAT first.  */
>  	if (GET_CODE (op0) == CONCAT && !must_force_mem)
>  	  {

Are you sure this is still needed after:

2018-06-04  Richard Sandiford  <richard.sandiford@linaro.org>

	* expr.c (expand_expr_real_1): Force the operand into memory if
	its TYPE_MODE is BLKmode and if there is no integer mode for
	the number of bits being extracted.

If so, what case is it handling differently?

> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 8d94fca..607a2bd 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -6702,12 +6702,12 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>  		     supported.  */
>  		  unsigned lsize
>  		    = group_size * GET_MODE_BITSIZE (elmode);
> -		  elmode = int_mode_for_size (lsize, 0).require ();
>  		  unsigned int lnunits = const_nunits / group_size;
>  		  /* If we can't construct such a vector fall back to
>  		     element extracts from the original vector type and
>  		     element size stores.  */
> -		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
> +		  if (int_mode_for_size (lsize, 0).exists (&elmode)
> +		      && mode_for_vector (elmode, lnunits).exists (&vmode)
>  		      && VECTOR_MODE_P (vmode)
>  		      && targetm.vector_mode_supported_p (vmode)
>  		      && (convert_optab_handler (vec_extract_optab,
> @@ -7839,11 +7839,11 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>  		     to a larger load.  */
>  		  unsigned lsize
>  		    = group_size * TYPE_PRECISION (TREE_TYPE (vectype));
> -		  elmode = int_mode_for_size (lsize, 0).require ();
>  		  unsigned int lnunits = const_nunits / group_size;
>  		  /* If we can't construct such a vector fall back to
>  		     element loads of the original vector type.  */
> -		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
> +		  if (int_mode_for_size (lsize, 0).exists (&elmode)
> +		      && mode_for_vector (elmode, lnunits).exists (&vmode)
>  		      && VECTOR_MODE_P (vmode)
>  		      && targetm.vector_mode_supported_p (vmode)
>  		      && (convert_optab_handler (vec_init_optab, vmode, elmode)

These two are OK independently of the rest (if that's convenient).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/25] Ignore LLVM's blank lines.
  2018-09-05 11:52 ` [PATCH 24/25] Ignore LLVM's blank lines ams
@ 2018-09-14 16:19   ` Jeff Law
  2020-03-23 15:29     ` Thomas Schwinge
  0 siblings, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-09-14 16:19 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:52 AM, ams@codesourcery.com wrote:
> 
> The GCN toolchain must use the LLVM assembler and linker because there's no
> binutils port.  The LLVM tools do not have the same diagnostic style as
> binutils, so the "blank line(s) in output" tests are inappropriate (and very
> noisy).
> 
> The LLVM tools also have different command line options, so it's not possible
> to autodetect object formats in the same way.
> 
> This patch addresses both issues.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 
> 	gcc/testsuite/
> 	* lib/file-format.exp (gcc_target_object_format): Handle AMD GCN.
> 	* lib/gcc-dg.exp (gcc-dg-prune): Ignore blank lines from the LLVM
> 	linker.
> 	* lib/target-supports.exp (check_effective_target_llvm_binutils): New.
This is fine.  It's a NOP for other targets, so feel free to commit when
it's convenient for you.

jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 23/25] Testsuite: GCN is always PIE.
  2018-09-05 11:52 ` [PATCH 23/25] Testsuite: GCN is always PIE ams
@ 2018-09-14 16:39   ` Jeff Law
  0 siblings, 0 replies; 187+ messages in thread
From: Jeff Law @ 2018-09-14 16:39 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:52 AM, ams@codesourcery.com wrote:
> 
> The GCN/HSA loader ignores the load address and uses a random location, so we
> build all GCN binaries as PIE, by default.
> 
> This patch makes the necessary testsuite adjustments to make this work
> correctly.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 
> 	gcc/testsuite/
> 	* gcc.dg/graphite/scop-19.c: Check pie_enabled.
> 	* gcc.dg/pic-1.c: Disable on amdgcn.
> 	* gcc.dg/pic-2.c: Disable on amdgcn.
> 	* gcc.dg/pic-3.c: Disable on amdgcn.
> 	* gcc.dg/pic-4.c: Disable on amdgcn.
> 	* gcc.dg/pie-3.c: Disable on amdgcn.
> 	* gcc.dg/pie-4.c: Disable on amdgcn.
> 	* gcc.dg/uninit-19.c: Check pie_enabled.
> 	* lib/target-supports.exp (check_effective_target_pie): Add amdgcn.
OK.  Commit at your leisure.

jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-12 17:39             ` Julian Brown
@ 2018-09-15  6:01               ` Julian Brown
  2018-09-19 15:23                 ` Julian Brown
  0 siblings, 1 reply; 187+ messages in thread
From: Julian Brown @ 2018-09-15  6:01 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: Richard Biener, Jeff Law, Jan Hubicka, GCC Patches

On Wed, 12 Sep 2018 13:34:06 -0400
Julian Brown <julian@codesourcery.com> wrote:

> On Wed, 12 Sep 2018 17:31:58 +0100
> Andrew Stubbs <ams@codesourcery.com> wrote:
> 
> > On 12/09/18 16:16, Richard Biener wrote:  
> > > I think the symptom GCN sees needs to be better understood - like
> > > wheter it is generally OK to mangle things arbitrarily.    
> > 
> > The name mangling is a horrible workaround for a bug in the HSA
> > runtime code (which we do not own, cannot fix, and would want to
> > support old versions of anyway).  Basically it refuses to load any
> > binary that has the same symbol as another, already loaded, binary,
> > regardless of the symbol's linkage.  Worse, it also rejects any
> > binary that has duplicate symbols within it, despite the fact that
> > it already linked just fine.
> > 
> > Adding the extra lookups is enough to build GCN binaries, with
> > mangled names, whereas the existing name mangling support was either
> > more specialized or bit rotten (I don't know which).
> > 
> > It may well be that there's a better way to solve the problem, or
> > at least to do the lookups.
> > 
> > It may also be that there are some unintended consequences, such as 
> > false name matches, but I don't know of any at present.
> > 
> > Julian, can you comment, please?  
> 
> I did the local-symbol name mangling in two places:
> 
> - The ASM_FORMAT_PRIVATE_NAME macro (good for local statics)
> - The TARGET_MANGLE_DECL_ASSEMBLER_NAME hook (for file-scope
>   local/statics)
> 
> Possibly, this was an abuse of these hooks, but it's arguably wrong
> that that e.g. handle_alias_pairs has the "assembler name" leak
> through into the user's source code -- if it's expected that the hook
> could make arbitrary transformations to the string. (The latter hook
> is only used by PE code for x86 at present, by the look of it, and
> the default handles only special-purpose mangling indicated by
> placing a '*' at the front of the symbol.)

One possibility might be to allow
symbol_table::decl_assembler_name_hash and
symbol_table::assembler_names_equal_p to be overridden by a target
hook, and define them for GCN to ignore the symbol "localisation"
magic. At the moment they will just ignore "*" at the start of a
symbol, and a (fixed) user label prefix, no matter what
TARGET_MANGLE_DECL_ASSEMBLER_NAME does.

Another way would be to do some appropriate mangling for local symbols
in the assembler, rather than the compiler (though we're using the LLVM
assembler, and so far have got away with not making any invasive
changes to that).

> If we had a symtab_node::get_for_name () using a suitable hash table,
> I think it'd probably be right to use that. Can that be done
> (easily), or is there some equivalent way? Introducing a new hash
> table everywhere for a bug workaround for a relatively obscure
> feature on a single target seems unfortunate.

An "obvious" solution of calling targetm.mangle_decl_assembler_name
before looking up in symtab_node::get_for_asmname, something like:

static void
handle_alias_pairs (void)
{
  alias_pair *p;
  unsigned i;

  for (i = 0; alias_pairs && alias_pairs->iterate (i, &p);)
    {
      tree asmname = targetm.mangle_decl_assembler_name (p->decl, p->target);
      symtab_node *target_node = symtab_node::get_for_asmname (asmname);
      [...]

seems like it could possibly work for handle_alias_pairs, but not so
much for c-pragma.c:maybe_apply_pending_pragma_weaks, where there is no
decl available to pass as the first argument to the target hook.

Julian

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode
  2018-09-05 11:50 ` [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode ams
@ 2018-09-17  8:46   ` Richard Sandiford
  2018-09-26 15:52     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17  8:46 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> This is an update of the patch posted to PR82089 long ago.  We ran into the
> same bug on GCN, so we need this fixed as part of this series.
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>             Tom de Vries  <tom@codesourcery.com>
>
> 	PR82089
>
> 	gcc/
> 	* expmed.c (emit_cstore): Fix handling of result_mode == BImode and
> 	STORE_FLAG_VALUE == 1.
> ---
>  gcc/expmed.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/expmed.c b/gcc/expmed.c
> index 29ce10b..0b87fdc 100644
> --- a/gcc/expmed.c
> +++ b/gcc/expmed.c
> @@ -5464,11 +5464,18 @@ emit_cstore (rtx target, enum insn_code icode, enum rtx_code code,
>       If STORE_FLAG_VALUE does not have the sign bit set when
>       interpreted in MODE, we can do this conversion as unsigned, which
>       is usually more efficient.  */
> -  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode))
> +  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode)
> +      || (result_mode == BImode && int_target_mode != BImode))

Would be better to test GET_MODE_PRECISION instead of GET_MODE_SIZE,
if that works, instead of treating BImode as a special case.

>      {
> -      convert_move (target, subtarget,
> -		    val_signbit_known_clear_p (result_mode,
> -					       STORE_FLAG_VALUE));
> +      gcc_assert (GET_MODE_SIZE (result_mode) != 1
> +		  || STORE_FLAG_VALUE == 1 || STORE_FLAG_VALUE == -1);
> +      bool unsignedp
> +	= (GET_MODE_SIZE (result_mode) == 1
> +	   ? STORE_FLAG_VALUE == 1
> +	   : val_signbit_known_clear_p (result_mode, STORE_FLAG_VALUE));
> +
> +      convert_move (target, subtarget, unsignedp);
> +

GET_MODE_SIZE == 1 would also trigger for QImode, which shouldn't be treated
differently from HImode etc.

The original val_signbit_known_clear_p test seems like it might be an
abstraction too far.  In practice STORE_FLAG_VALUE has to fit within
the mode of a natural (unextended) condition result, so I think we can
simply test STORE_FLAG_VALUE >= 0 for all modes to see whether the target
wants the result to be treated as signed or unsigned.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 10/25] Convert BImode vectors.
  2018-09-05 11:50 ` [PATCH 10/25] Convert BImode vectors ams
  2018-09-05 11:56   ` Jakub Jelinek
  2018-09-05 12:05   ` Richard Biener
@ 2018-09-17  8:51   ` Richard Sandiford
  2 siblings, 0 replies; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17  8:51 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> GCN uses V64BImode to represent vector masks in the middle-end, and DImode
> bit-masks to represent them in the back-end.  These must be converted at expand
> time and the most convenient way is to simply use a SUBREG.
>
> This works fine except that simplify_subreg needs to be able to convert
> immediates, mostly for REG_EQUAL and REG_EQUIV, and currently does not know how
> to convert vectors to integers where there is more than one element per byte.
>
> This patch implements such conversions for the cases that we need.
>
> I don't know why this is not a problem for other targets that use BImode
> vectors, such as ARM SVE, so it's possible I missed some magic somewhere?

FWIW, SVE never converts predicates to integers: they stay as V..BImode.

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-05 11:51 ` [PATCH 11/25] Simplify vec_merge according to the mask ams
@ 2018-09-17  9:08   ` Richard Sandiford
  2018-09-20 15:44     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17  9:08 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> This patch was part of the original patch we acquired from Honza and Martin.
>
> It simplifies vector elements that are inactive, according to the mask.
>
> 2018-09-05  Jan Hubicka  <jh@suse.cz>
> 	    Martin Jambor  <mjambor@suse.cz>
>
> 	* simplify-rtx.c (simplify_merge_mask): New function.
> 	(simplify_ternary_operation): Use it, also see if VEC_MERGEs with the
> 	same masks are used in op1 or op2.

Would be good to have self-tests for the new transforms.

> +/* X is an operand number OP of VEC_MERGE operation with MASK.

"of a".  Might also be worth mentioning that X can be a nested
operation of a VEC_MERGE with a different mode, although it always
has the same number of elements as MASK.

> +   Try to simplify using knowledge that values outside of MASK

"simplify X"

> +   will not be used.  */
> +
> +rtx
> +simplify_merge_mask (rtx x, rtx mask, int op)
> +{
> +  gcc_assert (VECTOR_MODE_P (GET_MODE (x)));
> +  poly_uint64 nunits = GET_MODE_NUNITS (GET_MODE (x));
> +  if (GET_CODE (x) == VEC_MERGE && rtx_equal_p (XEXP (x, 2), mask))
> +    {
> +      if (!side_effects_p (XEXP (x, 1 - op)))
> +	return XEXP (x, op);
> +    }
> +  if (side_effects_p (x))
> +    return NULL_RTX;
> +  if (UNARY_P (x)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
> +      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits))

known_eq, since we require equality for correctness.  Same for the
other tests.

> +    {
> +      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
> +      if (top0)
> +	return simplify_gen_unary (GET_CODE (x), GET_MODE (x), top0,
> +				   GET_MODE (XEXP (x, 0)));
> +    }
> +  if (BINARY_P (x)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
> +      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
> +      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits))
> +    {
> +      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
> +      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
> +      if (top0 || top1)
> +	return simplify_gen_binary (GET_CODE (x), GET_MODE (x),
> +				    top0 ? top0 : XEXP (x, 0),
> +				    top1 ? top1 : XEXP (x, 1));
> +    }
> +  if (GET_RTX_CLASS (GET_CODE (x)) == RTX_TERNARY
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
> +      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
> +      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 2)))
> +      && maybe_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 2))), nunits))
> +    {
> +      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
> +      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
> +      rtx top2 = simplify_merge_mask (XEXP (x, 2), mask, op);
> +      if (top0 || top1)
> +	return simplify_gen_ternary (GET_CODE (x), GET_MODE (x),
> +				     GET_MODE (XEXP (x, 0)),
> +				     top0 ? top0 : XEXP (x, 0),
> +				     top1 ? top1 : XEXP (x, 1),
> +				     top2 ? top2 : XEXP (x, 2));
> +    }
> +  return NULL_RTX;
> +}
> +
>  \f
>  /* Simplify CODE, an operation with result mode MODE and three operands,
>     OP0, OP1, and OP2.  OP0_MODE was the mode of OP0 before it became
> @@ -5967,6 +6026,28 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
>  	  && !side_effects_p (op2) && !side_effects_p (op1))
>  	return op0;
>  
> +      if (!side_effects_p (op2))
> +	{
> +	  rtx top0 = simplify_merge_mask (op0, op2, 0);
> +	  rtx top1 = simplify_merge_mask (op1, op2, 1);
> +	  if (top0 || top1)
> +	    return simplify_gen_ternary (code, mode, mode,
> +					 top0 ? top0 : op0,
> +					 top1 ? top1 : op1, op2);
> +	}
> +
> +      if (GET_CODE (op0) == VEC_MERGE
> +	  && rtx_equal_p (op2, XEXP (op0, 2))
> +	  && !side_effects_p (XEXP (op0, 1)) && !side_effects_p (op2))
> +	return simplify_gen_ternary (code, mode, mode,
> +				     XEXP (op0, 0), op1, op2);
> +
> +      if (GET_CODE (op1) == VEC_MERGE
> +	  && rtx_equal_p (op2, XEXP (op1, 2))
> +	  && !side_effects_p (XEXP (op0, 0)) && !side_effects_p (op2))
> +	return simplify_gen_ternary (code, mode, mode,
> +				     XEXP (op0, 1), op1, op2);

Doesn't simplify_merge_mask make the second two redundant?  I couldn't
see the difference between them and the first condition tested by
simplify_merge_mask.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-05 11:51 ` [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores ams
@ 2018-09-17  9:16   ` Richard Sandiford
  2018-09-17  9:54     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17  9:16 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
> blows away the register file and produces horrible code.

Do all the registers really need to be live at once, or is it "just" bad
scheduling?  I'd have expected the initial rtl to load each element and
then insert it immediately, so that the number of insertions doesn't
directly affect register pressure.

> This patch simply disallows elementwise loads for such large vectors.  Is there
> a better way to disable this in the middle-end?

Do you ever want elementwise accesses for GCN?  If not, it might be
better to disable them in the target's cost model.

Thanks,
Richard

>
> 2018-09-05  Julian Brown  <julian@codesourcery.com>
>
> 	gcc/
> 	* tree-vect-stmts.c (get_load_store_type): Don't use VMAT_ELEMENTWISE
> 	loads/stores with many-element (>=64) vectors.
> ---
>  gcc/tree-vect-stmts.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 8875201..a333991 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -2452,6 +2452,26 @@ get_load_store_type (stmt_vec_info stmt_info, tree vectype, bool slp,
>  	*memory_access_type = VMAT_CONTIGUOUS;
>      }
>  
> +  /* FIXME: Element-wise accesses can be extremely expensive if we have a
> +     large number of elements to deal with (e.g. 64 for AMD GCN) using the
> +     current generic code expansion.  Until an efficient code sequence is
> +     supported for affected targets instead, don't attempt vectorization for
> +     VMAT_ELEMENTWISE at all.  */
> +  if (*memory_access_type == VMAT_ELEMENTWISE)
> +    {
> +      poly_uint64 nelements = TYPE_VECTOR_SUBPARTS (vectype);
> +
> +      if (maybe_ge (nelements, 64))
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +	      "too many elements (%u) for elementwise accesses\n",
> +	      (unsigned) nelements.to_constant ());
> +
> +	  return false;
> +	}
> +    }
> +
>    if ((*memory_access_type == VMAT_ELEMENTWISE
>         || *memory_access_type == VMAT_STRIDED_SLP)
>        && !nunits.is_constant ())

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 15/25] Don't double-count early-clobber matches.
  2018-09-05 11:51 ` [PATCH 15/25] Don't double-count early-clobber matches ams
@ 2018-09-17  9:22   ` Richard Sandiford
  2018-09-27 22:54     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17  9:22 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> Given a pattern with a number of operands:
>
> (match_operand 0 "" "=&v")
> (match_operand 1 "" " v0")
> (match_operand 2 "" " v0")
> (match_operand 3 "" " v0")
>
> GCC will currently increment "reject" once, for operand 0, and then decrement
> it once for each of the other operands, ending with reject == -2 and an
> assertion failure.  If there's a conflict then it might try to decrement reject
> yet again.
>
> Incidentally, what these patterns are trying to achieve is an allocation in
> which operand 0 may match one of the other operands, but may not partially
> overlap any of them.  Ideally there'd be a better way to do this.
>
> In any case, it will affect any pattern in which multiple operands may (or
> must) match an early-clobber operand.
>
> The patch only allows a reject-- when one has not already occurred, for that
> operand.
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>
> 	gcc/
> 	* lra-constraints.c (process_alt_operands): Check
> 	matching_early_clobber before decrementing reject, and set
> 	matching_early_clobber after.
> 	* lra-int.h (struct lra_operand_data): Add matching_early_clobber.
> 	* lra.c (setup_operand_alternative): Initialize matching_early_clobber.
> ---
>  gcc/lra-constraints.c | 22 ++++++++++++++--------
>  gcc/lra-int.h         |  3 +++
>  gcc/lra.c             |  1 +
>  3 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
> index 8be4d46..55163f1 100644
> --- a/gcc/lra-constraints.c
> +++ b/gcc/lra-constraints.c
> @@ -2202,7 +2202,13 @@ process_alt_operands (int only_alternative)
>  				 "            %d Matching earlyclobber alt:"
>  				 " reject--\n",
>  				 nop);
> -			    reject--;
> +			    if (!curr_static_id->operand[m]
> +						 .matching_early_clobber)
> +			      {
> +				reject--;
> +				curr_static_id->operand[m]
> +						.matching_early_clobber = 1;
> +			      }
>  			  }
>  			/* Otherwise we prefer no matching
>  			   alternatives because it gives more freedom
> @@ -2948,15 +2954,11 @@ process_alt_operands (int only_alternative)
>  	      curr_alt_dont_inherit_ops[curr_alt_dont_inherit_ops_num++]
>  		= last_conflict_j;
>  	      losers++;
> -	      /* Early clobber was already reflected in REJECT. */
> -	      lra_assert (reject > 0);
>  	      if (lra_dump_file != NULL)
>  		fprintf
>  		  (lra_dump_file,
>  		   "            %d Conflict early clobber reload: reject--\n",
>  		   i);
> -	      reject--;
> -	      overall += LRA_LOSER_COST_FACTOR - 1;
>  	    }
>  	  else
>  	    {
> @@ -2980,17 +2982,21 @@ process_alt_operands (int only_alternative)
>  		}
>  	      curr_alt_win[i] = curr_alt_match_win[i] = false;
>  	      losers++;
> -	      /* Early clobber was already reflected in REJECT. */
> -	      lra_assert (reject > 0);
>  	      if (lra_dump_file != NULL)
>  		fprintf
>  		  (lra_dump_file,
>  		   "            %d Matched conflict early clobber reloads: "
>  		   "reject--\n",
>  		   i);
> +	    }
> +	  /* Early clobber was already reflected in REJECT. */
> +	  if (!curr_static_id->operand[i].matching_early_clobber)
> +	    {
> +	      lra_assert (reject > 0);
>  	      reject--;
> -	      overall += LRA_LOSER_COST_FACTOR - 1;
> +	      curr_static_id->operand[i].matching_early_clobber = 1;
>  	    }
> +	  overall += LRA_LOSER_COST_FACTOR - 1;
>  	}
>        if (lra_dump_file != NULL)
>  	fprintf (lra_dump_file, "          alt=%d,overall=%d,losers=%d,rld_nregs=%d\n",

The idea looks good to me FWIW, but you can't use curr_static_id for
the state, since that's a static description of the .md pattern rather
than data about this particular instance.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 16/25] Fix IRA ICE.
  2018-09-05 11:51 ` [PATCH 16/25] Fix IRA ICE ams
@ 2018-09-17  9:36   ` Richard Sandiford
  2018-09-18 22:00     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17  9:36 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> The IRA pass makes an assumption that any pseudos created after the pass begins
> were created explicitly by the pass itself and therefore will have
> corresponding entries in its other tables.
>
> The GCN back-end, however, often creates additional pseudos, in expand
> patterns, to represent the necessary EXEC value, and these break IRA's
> assumption and cause ICEs.
>
> This patch simply has IRA skip unknown pseudos, and the problem goes away.
>
> Presumably, it's not ideal that these registers have not been processed by IRA,
> but it does not appear to do any real harm.

Could you go into more detail about how this happens?  Other targets
also create pseudos in their move patterns.

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 22/25] Add dg-require-effective-target exceptions
  2018-09-05 11:52 ` [PATCH 22/25] Add dg-require-effective-target exceptions ams
@ 2018-09-17  9:40   ` Richard Sandiford
  2018-09-17 17:53   ` Mike Stump
  1 sibling, 0 replies; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17  9:40 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> There are a number of tests that fail because they assume that exceptions are
> available, but GCN does not support them, yet.
>
> This patch adds "dg-require-effective-target exceptions" in all the affected
> tests.  There's probably an automatic way to test for exceptions, but the
> current implementation simply says that AMD GCN does not support them.  This
> should ensure that no other targets are affected by the change.

Manual markup seems fine as long as it's agreed that maintainers of
affected targets are the ones responsible for keeping the markup up
to date (under the obvious rule of course).  There are so many target
selectors that it's hard to remember which options require explicit
tests and which don't, so I don't think the onus should be on every
developer adding a new exception-related test.

The new selector needs an entry in doc/sourcebuild.texi.
OK with that change, thanks.

Richard

>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 	    Kwok Cheung Yeung  <kcy@codesourcery.com>
> 	    Julian Brown  <julian@codesourcery.com>
> 	    Tom de Vries  <tom@codesourcery.com>
>
> 	gcc/testsuite/
> 	* c-c++-common/ubsan/pr71512-1.c: Require exceptions.
> 	* c-c++-common/ubsan/pr71512-2.c: Require exceptions.
> 	* gcc.c-torture/compile/pr34648.c: Require exceptions.
> 	* gcc.c-torture/compile/pr41469.c: Require exceptions.
> 	* gcc.dg/20111216-1.c: Require exceptions.
> 	* gcc.dg/cleanup-10.c: Require exceptions.
> 	* gcc.dg/cleanup-11.c: Require exceptions.
> 	* gcc.dg/cleanup-12.c: Require exceptions.
> 	* gcc.dg/cleanup-13.c: Require exceptions.
> 	* gcc.dg/cleanup-5.c: Require exceptions.
> 	* gcc.dg/cleanup-8.c: Require exceptions.
> 	* gcc.dg/cleanup-9.c: Require exceptions.
> 	* gcc.dg/gomp/pr29955.c: Require exceptions.
> 	* gcc.dg/lto/pr52097_0.c: Require exceptions.
> 	* gcc.dg/nested-func-5.c: Require exceptions.
> 	* gcc.dg/pch/except-1.c: Require exceptions.
> 	* gcc.dg/pch/valid-2.c: Require exceptions.
> 	* gcc.dg/pr41470.c: Require exceptions.
> 	* gcc.dg/pr42427.c: Require exceptions.
> 	* gcc.dg/pr44545.c: Require exceptions.
> 	* gcc.dg/pr47086.c: Require exceptions.
> 	* gcc.dg/pr51481.c: Require exceptions.
> 	* gcc.dg/pr51644.c: Require exceptions.
> 	* gcc.dg/pr52046.c: Require exceptions.
> 	* gcc.dg/pr54669.c: Require exceptions.
> 	* gcc.dg/pr56424.c: Require exceptions.
> 	* gcc.dg/pr64465.c: Require exceptions.
> 	* gcc.dg/pr65802.c: Require exceptions.
> 	* gcc.dg/pr67563.c: Require exceptions.
> 	* gcc.dg/tree-ssa/pr41469-1.c: Require exceptions.
> 	* gcc.dg/tree-ssa/ssa-dse-28.c: Require exceptions.
> 	* gcc.dg/vect/pr46663.c: Require exceptions.
> 	* lib/target-supports.exp (check_effective_target_exceptions): New.
> ---
>  gcc/testsuite/c-c++-common/ubsan/pr71512-1.c  |  1 +
>  gcc/testsuite/c-c++-common/ubsan/pr71512-2.c  |  1 +
>  gcc/testsuite/gcc.c-torture/compile/pr34648.c |  1 +
>  gcc/testsuite/gcc.c-torture/compile/pr41469.c |  1 +
>  gcc/testsuite/gcc.dg/20111216-1.c             |  1 +
>  gcc/testsuite/gcc.dg/cleanup-10.c             |  1 +
>  gcc/testsuite/gcc.dg/cleanup-11.c             |  1 +
>  gcc/testsuite/gcc.dg/cleanup-12.c             |  1 +
>  gcc/testsuite/gcc.dg/cleanup-13.c             |  1 +
>  gcc/testsuite/gcc.dg/cleanup-5.c              |  1 +
>  gcc/testsuite/gcc.dg/cleanup-8.c              |  1 +
>  gcc/testsuite/gcc.dg/cleanup-9.c              |  1 +
>  gcc/testsuite/gcc.dg/gomp/pr29955.c           |  1 +
>  gcc/testsuite/gcc.dg/lto/pr52097_0.c          |  1 +
>  gcc/testsuite/gcc.dg/nested-func-5.c          |  1 +
>  gcc/testsuite/gcc.dg/pch/except-1.c           |  1 +
>  gcc/testsuite/gcc.dg/pch/valid-2.c            |  2 +-
>  gcc/testsuite/gcc.dg/pr41470.c                |  1 +
>  gcc/testsuite/gcc.dg/pr42427.c                |  1 +
>  gcc/testsuite/gcc.dg/pr44545.c                |  1 +
>  gcc/testsuite/gcc.dg/pr47086.c                |  1 +
>  gcc/testsuite/gcc.dg/pr51481.c                |  1 +
>  gcc/testsuite/gcc.dg/pr51644.c                |  1 +
>  gcc/testsuite/gcc.dg/pr52046.c                |  1 +
>  gcc/testsuite/gcc.dg/pr54669.c                |  1 +
>  gcc/testsuite/gcc.dg/pr56424.c                |  1 +
>  gcc/testsuite/gcc.dg/pr64465.c                |  1 +
>  gcc/testsuite/gcc.dg/pr65802.c                |  1 +
>  gcc/testsuite/gcc.dg/pr67563.c                |  1 +
>  gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c     |  1 +
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c    |  1 +
>  gcc/testsuite/gcc.dg/vect/pr46663.c           |  1 +
>  gcc/testsuite/lib/target-supports.exp         | 10 ++++++++++
>  33 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c b/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c
> index 2a90ab1..8af9365 100644
> --- a/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c
> +++ b/gcc/testsuite/c-c++-common/ubsan/pr71512-1.c
> @@ -1,5 +1,6 @@
>  /* PR c/71512 */
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fnon-call-exceptions -ftrapv -fexceptions -fsanitize=undefined" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  #include "../../gcc.dg/pr44545.c"
> diff --git a/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c b/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c
> index 1c95593..0c16934 100644
> --- a/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c
> +++ b/gcc/testsuite/c-c++-common/ubsan/pr71512-2.c
> @@ -1,5 +1,6 @@
>  /* PR c/71512 */
>  /* { dg-do compile } */
>  /* { dg-options "-O -fexceptions -fnon-call-exceptions -ftrapv -fsanitize=undefined" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  #include "../../gcc.dg/pr47086.c"
> diff --git a/gcc/testsuite/gcc.c-torture/compile/pr34648.c b/gcc/testsuite/gcc.c-torture/compile/pr34648.c
> index 8bcdae0..90a88b9 100644
> --- a/gcc/testsuite/gcc.c-torture/compile/pr34648.c
> +++ b/gcc/testsuite/gcc.c-torture/compile/pr34648.c
> @@ -1,6 +1,7 @@
>  /* PR tree-optimization/34648 */
>  
>  /* { dg-options "-fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  extern const unsigned short int **bar (void) __attribute__ ((const));
>  const char *a;
> diff --git a/gcc/testsuite/gcc.c-torture/compile/pr41469.c b/gcc/testsuite/gcc.c-torture/compile/pr41469.c
> index 5917794..923bca2 100644
> --- a/gcc/testsuite/gcc.c-torture/compile/pr41469.c
> +++ b/gcc/testsuite/gcc.c-torture/compile/pr41469.c
> @@ -1,5 +1,6 @@
>  /* { dg-options "-fexceptions" } */
>  /* { dg-skip-if "requires alloca" { ! alloca } { "-O0" } { "" } } */
> +/* { dg-require-effective-target exceptions } */
>  
>  void
>  af (void *a)
> diff --git a/gcc/testsuite/gcc.dg/20111216-1.c b/gcc/testsuite/gcc.dg/20111216-1.c
> index cd82cf9..7f9395e 100644
> --- a/gcc/testsuite/gcc.dg/20111216-1.c
> +++ b/gcc/testsuite/gcc.dg/20111216-1.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O -fexceptions -fnon-call-exceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  extern void f2 () __attribute__ ((noreturn));
>  void
> diff --git a/gcc/testsuite/gcc.dg/cleanup-10.c b/gcc/testsuite/gcc.dg/cleanup-10.c
> index 16035b1..1af63ea 100644
> --- a/gcc/testsuite/gcc.dg/cleanup-10.c
> +++ b/gcc/testsuite/gcc.dg/cleanup-10.c
> @@ -1,5 +1,6 @@
>  /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
>  /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
> +/* { dg-require-effective-target exceptions } */
>  /* Verify that cleanups work with exception handling through signal frames
>     on alternate stack.  */
>  
> diff --git a/gcc/testsuite/gcc.dg/cleanup-11.c b/gcc/testsuite/gcc.dg/cleanup-11.c
> index ccc61ed..c1f19fe 100644
> --- a/gcc/testsuite/gcc.dg/cleanup-11.c
> +++ b/gcc/testsuite/gcc.dg/cleanup-11.c
> @@ -1,5 +1,6 @@
>  /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
>  /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
> +/* { dg-require-effective-target exceptions } */
>  /* Verify that cleanups work with exception handling through realtime signal
>     frames on alternate stack.  */
>  
> diff --git a/gcc/testsuite/gcc.dg/cleanup-12.c b/gcc/testsuite/gcc.dg/cleanup-12.c
> index efb9a58..2171e35 100644
> --- a/gcc/testsuite/gcc.dg/cleanup-12.c
> +++ b/gcc/testsuite/gcc.dg/cleanup-12.c
> @@ -4,6 +4,7 @@
>  /* { dg-options "-O2 -fexceptions" } */
>  /* { dg-skip-if "" { "ia64-*-hpux11.*" } } */
>  /* { dg-skip-if "" { ! nonlocal_goto } } */
> +/* { dg-require-effective-target exceptions } */
>  /* Verify unwind info in presence of alloca.  */
>  
>  #include <unwind.h>
> diff --git a/gcc/testsuite/gcc.dg/cleanup-13.c b/gcc/testsuite/gcc.dg/cleanup-13.c
> index 8a8db27..1b7ea5c 100644
> --- a/gcc/testsuite/gcc.dg/cleanup-13.c
> +++ b/gcc/testsuite/gcc.dg/cleanup-13.c
> @@ -3,6 +3,7 @@
>  /* { dg-options "-fexceptions" } */
>  /* { dg-skip-if "" { "ia64-*-hpux11.*" } } */
>  /* { dg-skip-if "" { ! nonlocal_goto } } */
> +/* { dg-require-effective-target exceptions } */
>  /* Verify DW_OP_* handling in the unwinder.  */
>  
>  #include <unwind.h>
> diff --git a/gcc/testsuite/gcc.dg/cleanup-5.c b/gcc/testsuite/gcc.dg/cleanup-5.c
> index 4257f9e..9ed2a7c 100644
> --- a/gcc/testsuite/gcc.dg/cleanup-5.c
> +++ b/gcc/testsuite/gcc.dg/cleanup-5.c
> @@ -3,6 +3,7 @@
>  /* { dg-options "-fexceptions" } */
>  /* { dg-skip-if "" { "ia64-*-hpux11.*" } } */
>  /* { dg-skip-if "" { ! nonlocal_goto } } */
> +/* { dg-require-effective-target exceptions } */
>  /* Verify that cleanups work with exception handling.  */
>  
>  #include <unwind.h>
> diff --git a/gcc/testsuite/gcc.dg/cleanup-8.c b/gcc/testsuite/gcc.dg/cleanup-8.c
> index 553c038..45abdb2 100644
> --- a/gcc/testsuite/gcc.dg/cleanup-8.c
> +++ b/gcc/testsuite/gcc.dg/cleanup-8.c
> @@ -1,5 +1,6 @@
>  /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
>  /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
> +/* { dg-require-effective-target exceptions } */
>  /* Verify that cleanups work with exception handling through signal
>     frames.  */
>  
> diff --git a/gcc/testsuite/gcc.dg/cleanup-9.c b/gcc/testsuite/gcc.dg/cleanup-9.c
> index fe28072..98dc268 100644
> --- a/gcc/testsuite/gcc.dg/cleanup-9.c
> +++ b/gcc/testsuite/gcc.dg/cleanup-9.c
> @@ -1,5 +1,6 @@
>  /* { dg-do run { target hppa*-*-hpux* *-*-linux* *-*-gnu* powerpc*-*-darwin* *-*-darwin[912]* } } */
>  /* { dg-options "-fexceptions -fnon-call-exceptions -O2" } */
> +/* { dg-require-effective-target exceptions } */
>  /* Verify that cleanups work with exception handling through realtime
>     signal frames.  */
>  
> diff --git a/gcc/testsuite/gcc.dg/gomp/pr29955.c b/gcc/testsuite/gcc.dg/gomp/pr29955.c
> index e49c11c..102898c 100644
> --- a/gcc/testsuite/gcc.dg/gomp/pr29955.c
> +++ b/gcc/testsuite/gcc.dg/gomp/pr29955.c
> @@ -1,6 +1,7 @@
>  /* PR c/29955 */
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fopenmp -fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  extern void bar (int);
>  
> diff --git a/gcc/testsuite/gcc.dg/lto/pr52097_0.c b/gcc/testsuite/gcc.dg/lto/pr52097_0.c
> index cd4af5d..1b3fda3 100644
> --- a/gcc/testsuite/gcc.dg/lto/pr52097_0.c
> +++ b/gcc/testsuite/gcc.dg/lto/pr52097_0.c
> @@ -1,5 +1,6 @@
>  /* { dg-lto-do link } */
>  /* { dg-lto-options { { -O -flto -fexceptions -fnon-call-exceptions --param allow-store-data-races=0 } } } */
> +/* { dg-require-effective-target exceptions } */
>  
>  typedef struct { unsigned int e0 : 16; } s1;
>  typedef struct { unsigned int e0 : 16; } s2;
> diff --git a/gcc/testsuite/gcc.dg/nested-func-5.c b/gcc/testsuite/gcc.dg/nested-func-5.c
> index 3545f37..591f8a2 100644
> --- a/gcc/testsuite/gcc.dg/nested-func-5.c
> +++ b/gcc/testsuite/gcc.dg/nested-func-5.c
> @@ -2,6 +2,7 @@
>  /* { dg-options "-fexceptions" } */
>  /* PR28516: ICE generating ARM unwind directives for nested functions.  */
>  /* { dg-require-effective-target trampolines } */
> +/* { dg-require-effective-target exceptions } */
>  
>  void ex(int (*)(void));
>  void foo(int i)
> diff --git a/gcc/testsuite/gcc.dg/pch/except-1.c b/gcc/testsuite/gcc.dg/pch/except-1.c
> index f81b098..30350ed 100644
> --- a/gcc/testsuite/gcc.dg/pch/except-1.c
> +++ b/gcc/testsuite/gcc.dg/pch/except-1.c
> @@ -1,4 +1,5 @@
>  /* { dg-options "-fexceptions -I." } */
> +/* { dg-require-effective-target exceptions } */
>  #include "except-1.h"
>  
>  int main(void) 
> diff --git a/gcc/testsuite/gcc.dg/pch/valid-2.c b/gcc/testsuite/gcc.dg/pch/valid-2.c
> index 3d8cb14..15a57c9 100644
> --- a/gcc/testsuite/gcc.dg/pch/valid-2.c
> +++ b/gcc/testsuite/gcc.dg/pch/valid-2.c
> @@ -1,5 +1,5 @@
>  /* { dg-options "-I. -Winvalid-pch -fexceptions" } */
> -
> +/* { dg-require-effective-target exceptions } */
>  #include "valid-2.h" /* { dg-warning "settings for -fexceptions do not match" } */
>  /* { dg-error "No such file" "no such file" { target *-*-* } 0 } */
>  /* { dg-error "they were invalid" "invalid files" { target *-*-* } 0 } */
> diff --git a/gcc/testsuite/gcc.dg/pr41470.c b/gcc/testsuite/gcc.dg/pr41470.c
> index 7ef0086..7374fac 100644
> --- a/gcc/testsuite/gcc.dg/pr41470.c
> +++ b/gcc/testsuite/gcc.dg/pr41470.c
> @@ -1,6 +1,7 @@
>  /* { dg-do compile } */
>  /* { dg-options "-fexceptions" } */
>  /* { dg-require-effective-target alloca } */
> +/* { dg-require-effective-target exceptions } */
>  
>  void cf (void *);
>  
> diff --git a/gcc/testsuite/gcc.dg/pr42427.c b/gcc/testsuite/gcc.dg/pr42427.c
> index cb43dd2..cb290fe 100644
> --- a/gcc/testsuite/gcc.dg/pr42427.c
> +++ b/gcc/testsuite/gcc.dg/pr42427.c
> @@ -2,6 +2,7 @@
>  /* { dg-options "-O2 -fexceptions -fnon-call-exceptions -fpeel-loops" } */
>  /* { dg-add-options c99_runtime } */
>  /* { dg-require-effective-target ilp32 } */
> +/* { dg-require-effective-target exceptions } */
>  
>  #include <complex.h>
>  
> diff --git a/gcc/testsuite/gcc.dg/pr44545.c b/gcc/testsuite/gcc.dg/pr44545.c
> index 8058261..37f75f1 100644
> --- a/gcc/testsuite/gcc.dg/pr44545.c
> +++ b/gcc/testsuite/gcc.dg/pr44545.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fnon-call-exceptions -ftrapv -fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  void
>  DrawChunk(int *tabSize, int x) 
>  {
> diff --git a/gcc/testsuite/gcc.dg/pr47086.c b/gcc/testsuite/gcc.dg/pr47086.c
> index 71743fe..473e802 100644
> --- a/gcc/testsuite/gcc.dg/pr47086.c
> +++ b/gcc/testsuite/gcc.dg/pr47086.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O -fexceptions -fnon-call-exceptions -ftrapv" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  void
>  foo ()
> diff --git a/gcc/testsuite/gcc.dg/pr51481.c b/gcc/testsuite/gcc.dg/pr51481.c
> index d883d47..a35f8f3 100644
> --- a/gcc/testsuite/gcc.dg/pr51481.c
> +++ b/gcc/testsuite/gcc.dg/pr51481.c
> @@ -1,6 +1,7 @@
>  /* PR tree-optimization/51481 */
>  /* { dg-do compile } */
>  /* { dg-options "-O -fexceptions -fipa-cp -fipa-cp-clone" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  extern const unsigned short int **foo (void)
>    __attribute__ ((__nothrow__, __const__));
> diff --git a/gcc/testsuite/gcc.dg/pr51644.c b/gcc/testsuite/gcc.dg/pr51644.c
> index 2038a0c..e23c02f 100644
> --- a/gcc/testsuite/gcc.dg/pr51644.c
> +++ b/gcc/testsuite/gcc.dg/pr51644.c
> @@ -1,6 +1,7 @@
>  /* PR middle-end/51644 */
>  /* { dg-do compile } */
>  /* { dg-options "-Wall -fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  #include <stdarg.h>
>  
> diff --git a/gcc/testsuite/gcc.dg/pr52046.c b/gcc/testsuite/gcc.dg/pr52046.c
> index e72061f..f0873e2 100644
> --- a/gcc/testsuite/gcc.dg/pr52046.c
> +++ b/gcc/testsuite/gcc.dg/pr52046.c
> @@ -1,6 +1,7 @@
>  /* PR tree-optimization/52046 */
>  /* { dg-do compile } */
>  /* { dg-options "-O3 -fexceptions -fnon-call-exceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  extern float a[], b[], c[], d[];
>  extern int k[];
> diff --git a/gcc/testsuite/gcc.dg/pr54669.c b/gcc/testsuite/gcc.dg/pr54669.c
> index b68c047..48967ed 100644
> --- a/gcc/testsuite/gcc.dg/pr54669.c
> +++ b/gcc/testsuite/gcc.dg/pr54669.c
> @@ -3,6 +3,7 @@
>  
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fexceptions -fnon-call-exceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  int a[10];
>  
> diff --git a/gcc/testsuite/gcc.dg/pr56424.c b/gcc/testsuite/gcc.dg/pr56424.c
> index a724c64..7f28f04 100644
> --- a/gcc/testsuite/gcc.dg/pr56424.c
> +++ b/gcc/testsuite/gcc.dg/pr56424.c
> @@ -2,6 +2,7 @@
>  
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fexceptions -fnon-call-exceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  extern long double cosl (long double);
>  extern long double sinl (long double);
> diff --git a/gcc/testsuite/gcc.dg/pr64465.c b/gcc/testsuite/gcc.dg/pr64465.c
> index acfa952..d1d1749 100644
> --- a/gcc/testsuite/gcc.dg/pr64465.c
> +++ b/gcc/testsuite/gcc.dg/pr64465.c
> @@ -1,6 +1,7 @@
>  /* PR tree-optimization/64465 */
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  extern int foo (int *);
>  extern int bar (int, int);
> diff --git a/gcc/testsuite/gcc.dg/pr65802.c b/gcc/testsuite/gcc.dg/pr65802.c
> index fcec234..0721ca8 100644
> --- a/gcc/testsuite/gcc.dg/pr65802.c
> +++ b/gcc/testsuite/gcc.dg/pr65802.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O0 -fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  #include <stdarg.h>
>  
> diff --git a/gcc/testsuite/gcc.dg/pr67563.c b/gcc/testsuite/gcc.dg/pr67563.c
> index 34a78a2..5a727b8 100644
> --- a/gcc/testsuite/gcc.dg/pr67563.c
> +++ b/gcc/testsuite/gcc.dg/pr67563.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  static void
>  emit_package (int p1)
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c b/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c
> index 6be7cd9..eb8e1f2 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr41469-1.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fexceptions -fdump-tree-optimized" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  void af (void *a);
>  
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
> index d35377b..d3a1bbc 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fdump-tree-dse-details -fexceptions -fnon-call-exceptions -fno-isolate-erroneous-paths-dereference" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  
>  int foo (int *p, int b)
> diff --git a/gcc/testsuite/gcc.dg/vect/pr46663.c b/gcc/testsuite/gcc.dg/vect/pr46663.c
> index 457ceae..c2e56bb 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr46663.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr46663.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-additional-options "-O -fexceptions" } */
> +/* { dg-require-effective-target exceptions } */
>  
>  typedef __attribute__ ((const)) int (*bart) (void);
>  
> diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
> index b51e8f0..e27bed0 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -8826,6 +8826,16 @@ proc check_effective_target_fenv_exceptions {} {
>      } [add_options_for_ieee "-std=gnu99"]]
>  }
>  
> +# Return 1 if -fexceptions is supported.
> +
> +proc check_effective_target_exceptions {} {
> +    if { [istarget amdgcn*-*-*] } {
> +	return 0
> +    }
> +    return 1
> +}
> +
> +
>  proc check_effective_target_tiny {} {
>      global et_target_tiny_saved
>  

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-17  9:16   ` Richard Sandiford
@ 2018-09-17  9:54     ` Andrew Stubbs
  2018-09-17 12:40       ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-17  9:54 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 17/09/18 10:14, Richard Sandiford wrote:
> <ams@codesourcery.com> writes:
>> If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
>> blows away the register file and produces horrible code.
> 
> Do all the registers really need to be live at once, or is it "just" bad
> scheduling?  I'd have expected the initial rtl to load each element and
> then insert it immediately, so that the number of insertions doesn't
> directly affect register pressure.

They don't need to be live at once, architecturally speaking, but that's 
the way it happened.  No doubt there is another solution to fix it, but 
it's not a use case I believe we want to spend time optimizing.

Actually, I've not tested what happens without this in GCC 9, so that's 
probably worth checking, but I'd still be concerned about it blowing up 
on real code somewhere.

>> This patch simply disallows elementwise loads for such large vectors.  Is there
>> a better way to disable this in the middle-end?
> 
> Do you ever want elementwise accesses for GCN?  If not, it might be
> better to disable them in the target's cost model.

The hardware is perfectly capable of extracting or setting vector 
elements, but given that it can do full gather/scatter from arbitrary 
addresses it's not something we want to do in general.

A normal scalar load will use a vector register (lane 0). The value then 
has to be moved to a scalar register, and only then can v_writelane 
insert it into the final destination.

Alternatively you could use a mask_load to load the value directly to 
the correct lane, but I don't believe that's something GCC does.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-17  9:54     ` Andrew Stubbs
@ 2018-09-17 12:40       ` Richard Sandiford
  2018-09-17 12:46         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17 12:40 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 10:14, Richard Sandiford wrote:
>> <ams@codesourcery.com> writes:
>>> If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
>>> blows away the register file and produces horrible code.
>> 
>> Do all the registers really need to be live at once, or is it "just" bad
>> scheduling?  I'd have expected the initial rtl to load each element and
>> then insert it immediately, so that the number of insertions doesn't
>> directly affect register pressure.
>
> They don't need to be live at once, architecturally speaking, but that's 
> the way it happened.  No doubt there is another solution to fix it, but 
> it's not a use case I believe we want to spend time optimizing.
>
> Actually, I've not tested what happens without this in GCC 9, so that's 
> probably worth checking, but I'd still be concerned about it blowing up 
> on real code somewhere.
>
>>> This patch simply disallows elementwise loads for such large vectors.
>>> Is there
>>> a better way to disable this in the middle-end?
>> 
>> Do you ever want elementwise accesses for GCN?  If not, it might be
>> better to disable them in the target's cost model.
>
> The hardware is perfectly capable of extracting or setting vector 
> elements, but given that it can do full gather/scatter from arbitrary 
> addresses it's not something we want to do in general.
>
> A normal scalar load will use a vector register (lane 0). The value then 
> has to be moved to a scalar register, and only then can v_writelane 
> insert it into the final destination.

OK, sounds like the cost of vec_construct is too low then.  But looking
at the port, I see you have:

/* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */

int
gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
			tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
{
  /* Always vectorize.  */
  return 1;
}

which short-circuits the cost-model altogether.  Isn't that part
of the problem?

Richard

>
> Alternatively you could use a mask_load to load the value directly to 
> the correct lane, but I don't believe that's something GCC does.
>
> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-17 12:40       ` Richard Sandiford
@ 2018-09-17 12:46         ` Andrew Stubbs
  2018-09-20 13:01           ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-17 12:46 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 17/09/18 12:43, Richard Sandiford wrote:
> OK, sounds like the cost of vec_construct is too low then.  But looking
> at the port, I see you have:
> 
> /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
> 
> int
> gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
> 			tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
> {
>    /* Always vectorize.  */
>    return 1;
> }
> 
> which short-circuits the cost-model altogether.  Isn't that part
> of the problem?

Well, it's possible that that's a little simplistic. ;-)

Although, actually the elementwise issue predates the existence of 
gcn_vectorization_cost, and the default does appear to penalize 
vec_construct somewhat.

Actually, the default definition doesn't seem to do much besides 
increase vec_construct, so I'm not sure now why I needed to change it? 
Hmm, more experiments to do.

Thanks for the pointer.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 22/25] Add dg-require-effective-target exceptions
  2018-09-05 11:52 ` [PATCH 22/25] Add dg-require-effective-target exceptions ams
  2018-09-17  9:40   ` Richard Sandiford
@ 2018-09-17 17:53   ` Mike Stump
  2018-09-20 16:10     ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Mike Stump @ 2018-09-17 17:53 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

On Sep 5, 2018, at 4:52 AM, ams@codesourcery.com wrote:
> There are a number of tests that fail because they assume that exceptions are
> available, but GCN does not support them, yet.

So, generally we don't goop up the testsuite with the day to day port stuff when it is being developed.  If the port is finished, and EH can't be done, this type of change is fine.  If someone plans on doing it in the next 5 years and the port is still being developed, there is likely little reason to do this.  People that track regressions do so by differencing, and that easily handles massive amounts of failures seamlessly.

So, my question would be, has is just not been worked on yet, or, is it basically impossible to ever do it?

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 12/25] Make default_static_chain return NULL in non-static functions
  2018-09-05 11:50 ` [PATCH 12/25] Make default_static_chain return NULL in non-static functions ams
@ 2018-09-17 18:55   ` Richard Sandiford
  2018-09-28 14:23     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17 18:55 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> This patch allows default_static_chain to be called from the back-end without
> it knowing if the function is static or not.  Or, to put it another way,
> without duplicating the check everywhere it's used.
>
> 2018-09-05  Tom de Vries  <tom@codesourcery.com>
>
> 	gcc/
> 	* targhooks.c (default_static_chain): Return NULL in non-static
> 	functions.
> ---
>  gcc/targhooks.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/targhooks.c b/gcc/targhooks.c
> index afd56f3..742cfbf 100644
> --- a/gcc/targhooks.c
> +++ b/gcc/targhooks.c
> @@ -1021,8 +1021,14 @@ default_internal_arg_pointer (void)
>  }
>  
>  rtx
> -default_static_chain (const_tree ARG_UNUSED (fndecl_or_type), bool incoming_p)
> +default_static_chain (const_tree fndecl_or_type, bool incoming_p)
>  {
> +  /* While this function won't be called by the middle-end when a static
> +     chain isn't needed, it's also used throughout the backend so it's
> +     easiest to keep this check centralized.  */
> +  if (DECL_P (fndecl_or_type) && !DECL_STATIC_CHAIN (fndecl_or_type))
> +    return NULL;
> +
>    if (incoming_p)
>      {
>  #ifdef STATIC_CHAIN_INCOMING_REGNUM

Not sure about this.  The caller has to make sure the query's sensible
for types, since types don't indicate whether they need a static chain.
Allowing it to be more sloppy for decls seems a bit dangerous.

Which part of the backend needs this?  I couldn't tell from a quick
grep where the call came from.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-05 11:51 ` [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE ams
@ 2018-09-17 19:31   ` Richard Sandiford
  2018-09-18  9:02     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-17 19:31 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

<ams@codesourcery.com> writes:
> This feature probably ought to be reworked as a proper target hook, but I would
> like to know if this is the correct solution to the problem first.
>
> The problem is that GCN vectors have a fixed number of elements (64) and the
> vector size varies with element size.  E.g. V64QI is 64 bytes and V64SI is 256
> bytes.
>
> This is a problem because GCC has an assumption that a) vector registers are
> fixed size, and b) if there are multiple vector sizes you want to pick one size
> and stick with it for the whole function.

The whole of the current vectorisation region rather than the whole function,
but yeah, this is a fundamental assumption with the current autovec code.
It's something that would be really good to fix...

> This is a problem in various places, but mostly it's not fatal. However,
> get_vectype_for_scalar_type caches the vector size for the first type it
> encounters and then tries to apply that to all subsequent vectors, which
> completely destroys vectorization.  The caching feature appears to be an
> attempt to cope with AVX having a different vector size to other x86 vector
> options.
>
> This patch simply disables the cache so that it must ask the backend for the
> preferred mode for every type.

TBH I'm surprised this works.  Obviously it does, otherwise you wouldn't
have posted it, but it seems like an accident.  Various parts of the
vectoriser query current_vector_size and expect it to be stable for
the current choice of vector size.

The underlying problem also affects (at least) base AArch64, SVE and x86_64.
We try to choose vector types on the fly based only on the type of a given
scalar value, but in reality, the type we want for a 32-bit element (say)
often depends on whether the vectorisation region also has smaller or
larger elements.  And in general we only know that after
vect_mark_stmts_to_be_vectorized, but we want to know the vector types
earlier, such as in pattern recognition and while building SLP trees.
It's a bit of a chicken-and-egg problem...

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 14:14         ` Andrew Stubbs
  2018-09-13 14:39           ` Paul Koning
@ 2018-09-17 22:59           ` Jeff Law
  1 sibling, 0 replies; 187+ messages in thread
From: Jeff Law @ 2018-09-17 22:59 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 9/13/18 8:08 AM, Andrew Stubbs wrote:
> On 13/09/18 11:01, Andrew Stubbs wrote:
>> The assert is caused because the def-use chains indicate that SCC
>> conflicts with itself. I suppose the question is why is it doing that,
>> but it's probably do do with that being a special register that gets
>> used in split2 (particularly by the addptrdi3 pattern). Although,
>> those patterns are careful to save SCC to one side and then restore it
>> again after, so I'd have thought the DF analysis would work out?
> 
> I think I may have a theory on this one now....
> 
> The addptrdi3 pattern must use two 32-bit adds with a carry in SCC, but
> addptr patterns are not allowed to clobber SCC, so the splitter
> carefully saves and restores the old value.
> 
> This is correct at runtime, and looks correct in RTL dumps, but it means
> that there's still a single rtx REG instance holding the live SCC
> register, and its still live before and after the new add instruction.
> 
> Would I be right in thinking that the dataflow analysis doesn't like this?
> 
> I think I have a work-around (by using different instructions), but is
> there a correct way to do this if there weren't an alternative?
I would expect dataflow to treat the SCC save as a use of the SCC
register.  That's likely to cause it to be live on all paths from the
entry to the SCC save.

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-17 19:31   ` Richard Sandiford
@ 2018-09-18  9:02     ` Andrew Stubbs
  2018-09-18 11:30       ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-18  9:02 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 17/09/18 20:28, Richard Sandiford wrote:
>> This patch simply disables the cache so that it must ask the backend for the
>> preferred mode for every type.
> 
> TBH I'm surprised this works.  Obviously it does, otherwise you wouldn't
> have posted it, but it seems like an accident.  Various parts of the
> vectoriser query current_vector_size and expect it to be stable for
> the current choice of vector size.

Indeed, this is why this remains only a half-baked patch: I wasn't 
confident it was the correct or whole solution.

It works in so much as it fixes the immediate problem that I saw -- "no 
vector type" -- and makes a bunch of vect.exp testcases happy.

It's quite possible that something else is unhappy with this.

> The underlying problem also affects (at least) base AArch64, SVE and x86_64.
> We try to choose vector types on the fly based only on the type of a given
> scalar value, but in reality, the type we want for a 32-bit element (say)
> often depends on whether the vectorisation region also has smaller or
> larger elements.  And in general we only know that after
> vect_mark_stmts_to_be_vectorized, but we want to know the vector types
> earlier, such as in pattern recognition and while building SLP trees.
> It's a bit of a chicken-and-egg problem...

I don't understand why the number of bits in a vector is the key 
information here?

It would make sense if you were to say that the number of elements has 
to be fixed in a given region, because obviously that's tied to loop 
strides and such, but why the size?

It seems like there is an architecture were you don't want to mix 
instruction types (SSE vs. AVX?) and that makes sense for that 
architecture, but if that's the case then we need to be able to turn it 
off for other architectures.

For GCN, vectors are fully maskable, so we almost want such 
considerations to be completely ignored.  We basically want it to act 
like it can have any size vector it likes, up to 64 elements.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-18  9:02     ` Andrew Stubbs
@ 2018-09-18 11:30       ` Richard Sandiford
  2018-09-18 20:27         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-18 11:30 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 20:28, Richard Sandiford wrote:
>>> This patch simply disables the cache so that it must ask the backend for the
>>> preferred mode for every type.
>> 
>> TBH I'm surprised this works.  Obviously it does, otherwise you wouldn't
>> have posted it, but it seems like an accident.  Various parts of the
>> vectoriser query current_vector_size and expect it to be stable for
>> the current choice of vector size.
>
> Indeed, this is why this remains only a half-baked patch: I wasn't 
> confident it was the correct or whole solution.
>
> It works in so much as it fixes the immediate problem that I saw -- "no 
> vector type" -- and makes a bunch of vect.exp testcases happy.
>
> It's quite possible that something else is unhappy with this.
>
>> The underlying problem also affects (at least) base AArch64, SVE and x86_64.
>> We try to choose vector types on the fly based only on the type of a given
>> scalar value, but in reality, the type we want for a 32-bit element (say)
>> often depends on whether the vectorisation region also has smaller or
>> larger elements.  And in general we only know that after
>> vect_mark_stmts_to_be_vectorized, but we want to know the vector types
>> earlier, such as in pattern recognition and while building SLP trees.
>> It's a bit of a chicken-and-egg problem...
>
> I don't understand why the number of bits in a vector is the key 
> information here?

Arguably it shouldn't be, and it's really just a proxy for the vector
(sub)architecture.  But this is "should be" vs. "is" :-)

> It would make sense if you were to say that the number of elements has 
> to be fixed in a given region, because obviously that's tied to loop 
> strides and such, but why the size?
>
> It seems like there is an architecture were you don't want to mix 
> instruction types (SSE vs. AVX?) and that makes sense for that 
> architecture, but if that's the case then we need to be able to turn it 
> off for other architectures.

It's not about trying to avoid mixing vector sizes: from what Jakub
said earlier in the year, even x86 wants to do that (but can't yet).
The idea is instead to try the available possibilities.

E.g. for AArch64 we want to try SVE, 128-bit Advanced SIMD and
64-bit Advanced SIMD.  With something like:

  int *ip;
  short *sp;
  for (int i = 0; i < n; ++i)
    ip[i] = sp[i];

there are three valid choices for Advanced SIMD:

(1) use 1 128-bit vector of sp and 2 128-bit vectors of ip
(2) use 1 64-bit vector of sp and 2 64-bit vectors of ip
(3) use 1 64-bit vector of sp and 1 128-bit vector of ip

At the moment we only try (1) and (2), but in practice, (3) should be
better than (2) in most cases.  I guess in some ways trying all three
would be best, but if we only try two, trying (1) and (3) is better
than trying (1) and (2).

For:

  for (int i = 0; i < n; ++i)
    ip[i] += 1;

there are two valid choices for Advanced SIMD:

(4) use 1 128-bit vector of ip
(5) use 1 64-bit vector of ip

The problem for the current autovec set-up is that the ip type for
64-bit Advanced SIMD varies between (3) and (5): for (3) it's a
128-bit vector type and for (5) it's a 64-bit vector type.
So the type we want for a given vector subarchitecture is partly
determined by the other types in the region: it isn't simply a
function of the subarchitecture and the element type.

This is why the current autovec code only supports (1), (2),
(4) and (5).  And I think this is essentially the same limitation
that you're hitting.

> For GCN, vectors are fully maskable, so we almost want such 
> considerations to be completely ignored.  We basically want it to act 
> like it can have any size vector it likes, up to 64 elements.

SVE is similar.  But even for SVE there's an equivalent trade-off
between (1) and (3):

(1') use 1 fully-populated vector for sp and 2 fully-populated
     vectors for ip
(3') use 1 half-populated vector for sp and 1 fully-populated
     vector for ip

Which is best for more complicated examples depends on the balance
between ip-based work and sp-based work.  The packing and unpacking
in (1') has a cost, but it would pay off if there was much more
sp work than ip work, since in that case (3') would spend most
of its time operating on partially-populated vectors.

Would the same be useful for GCN, or do you basically always
want a VF of 64?

None of this is a fundamental restriction in theory.  It's just
something that needs to be fixed.

One approach would be to get the loop vectoriser to iterate over the
number of lanes the target supports insteaad of all possible vector
sizes.  The problem is that on its own this would mean trying 4
lane counts even on targets with a single supported vector size.
So we'd need to do something a bit smarter...

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-18 11:30       ` Richard Sandiford
@ 2018-09-18 20:27         ` Andrew Stubbs
  2018-09-19 13:46           ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-18 20:27 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 18/09/18 12:21, Richard Sandiford wrote:
> Would the same be useful for GCN, or do you basically always
> want a VF of 64?

Always 64; the vector size varies between 512-bit and 4096-bit, as needed.

> None of this is a fundamental restriction in theory.  It's just
> something that needs to be fixed.
> 
> One approach would be to get the loop vectoriser to iterate over the
> number of lanes the target supports insteaad of all possible vector
> sizes.  The problem is that on its own this would mean trying 4
> lane counts even on targets with a single supported vector size.
> So we'd need to do something a bit smarter...

Yeah, that sounds like an interesting project, but way more than I think 
I need. Basically, we don't need to iterate over anything; there's only 
one option for each mode.

For the purposes of this patch, might it be enough to track down all the 
places that use the current_vector_size and fix them up, somehow?

Obviously, I'm not sure what that means just yet ...

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 16/25] Fix IRA ICE.
  2018-09-17  9:36   ` Richard Sandiford
@ 2018-09-18 22:00     ` Andrew Stubbs
  2018-09-20 12:47       ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-18 22:00 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 17/09/18 10:22, Richard Sandiford wrote:
> <ams@codesourcery.com> writes:
>> The IRA pass makes an assumption that any pseudos created after the pass begins
>> were created explicitly by the pass itself and therefore will have
>> corresponding entries in its other tables.
>>
>> The GCN back-end, however, often creates additional pseudos, in expand
>> patterns, to represent the necessary EXEC value, and these break IRA's
>> assumption and cause ICEs.
>>
>> This patch simply has IRA skip unknown pseudos, and the problem goes away.
>>
>> Presumably, it's not ideal that these registers have not been processed by IRA,
>> but it does not appear to do any real harm.
> 
> Could you go into more detail about how this happens?  Other targets
> also create pseudos in their move patterns.

Here's a simplified snippet from the machine description:

(define_expand "mov<mode>" 
 
 

   [(set (match_operand:VEC_REG_MODE 0 "nonimmediate_operand") 
 
 

         (match_operand:VEC_REG_MODE 1 "general_operand"))] 
 
 

   "" 
 
 

   { 
 
 

     [...]
 
 
 

     if (can_create_pseudo_p ()) 
 
 

       { 
 
 

         rtx exec = gcn_full_exec_reg ();
         rtx undef = gcn_gen_undef (<MODE>mode);
 

         [...]

	emit_insn (gen_mov<mode>_vector (operands[0], operands[1], exec
                                          undef));
         [...]

         DONE;
       }
   })

gcn_full_exec_reg creates a new pseudo. It gets used as the mask 
parameter of a vec_merge.

These registers then trip the asserts in ira.c.

In the case of setup_preferred_alternate_classes_for_new_pseudos it's 
because they have numbers greater than "start" but have not been 
initialized with different ORIGINAL_REGNO (why would they have been?)

In the case of move_unallocated_pseudos it's because the table 
pseudo_replaced_reg only has entries for the new pseudos directly 
created by find_moveable_pseudos, not the ones created indirectly.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-18 20:27         ` Andrew Stubbs
@ 2018-09-19 13:46           ` Richard Biener
  2018-09-28 12:48             ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-19 13:46 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: GCC Patches, Richard Sandiford

On Tue, Sep 18, 2018 at 10:22 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 18/09/18 12:21, Richard Sandiford wrote:
> > Would the same be useful for GCN, or do you basically always
> > want a VF of 64?
>
> Always 64; the vector size varies between 512-bit and 4096-bit, as needed.
>
> > None of this is a fundamental restriction in theory.  It's just
> > something that needs to be fixed.
> >
> > One approach would be to get the loop vectoriser to iterate over the
> > number of lanes the target supports insteaad of all possible vector
> > sizes.  The problem is that on its own this would mean trying 4
> > lane counts even on targets with a single supported vector size.
> > So we'd need to do something a bit smarter...
>
> Yeah, that sounds like an interesting project, but way more than I think
> I need. Basically, we don't need to iterate over anything; there's only
> one option for each mode.
>
> For the purposes of this patch, might it be enough to track down all the
> places that use the current_vector_size and fix them up, somehow?
>
> Obviously, I'm not sure what that means just yet ...

I think the only part that wants a "fixed" size is the code iterating over
vector sizes.  All the rest of the code simply wants to commit to
a specific vector type for each DEF - to match the ISAs we've faced
sofar the approach is simply to choose the vector type of current_vector_size
size and proper element type.

I've long wanted to fix that part in a way to actually commit to vector types
later and compute the DEF vector type of a stmt by looking at the vector
type of the USEs and the operation.

So I guess the current_vector_size thing isn't too hard to get rid of, what
you'd end up with would be using that size when you decide for vector
types for loads (where there are no USEs with vector types, so for example
this would not apply to gathers).

So I'd say you want to refactor get_same_sized_vectype uses and
make the size argument to get_vectype_for_scalar_type_and_size
a hint only.

Richard.

>
> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-15  6:01               ` Julian Brown
@ 2018-09-19 15:23                 ` Julian Brown
  2018-09-20 12:36                   ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Julian Brown @ 2018-09-19 15:23 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: Richard Biener, Jeff Law, Jan Hubicka, GCC Patches

On Fri, 14 Sep 2018 22:49:35 -0400
Julian Brown <julian@codesourcery.com> wrote:

> > > On 12/09/18 16:16, Richard Biener wrote:    
> > > It may well be that there's a better way to solve the problem, or
> > > at least to do the lookups.
> > > 
> > > It may also be that there are some unintended consequences, such
> > > as false name matches, but I don't know of any at present.

> > Possibly, this was an abuse of these hooks, but it's arguably wrong
> > that that e.g. handle_alias_pairs has the "assembler name" leak
> > through into the user's source code -- if it's expected that the
> > hook could make arbitrary transformations to the string. (The
> > latter hook is only used by PE code for x86 at present, by the look
> > of it, and the default handles only special-purpose mangling
> > indicated by placing a '*' at the front of the symbol.)  

Two places I've found that currently expose the underlying symbol name
in the user's source code: one (documented!) is C++, where one must
write the mangled symbol name as the alias target:

int foo (int c) { ... }
int bar (int) __attribute__((alias("_Z3fooi")));

another (perhaps obscure) is x86/PE with "fastcall":

__attribute__((fastcall)) void foo(void) { ... }
void bar(void) __attribute__((alias("@foo@0")));

both of which probably suggest that using the decl name, rather than
demangling the assembler name (or using some completely different
solution) was the wrong thing to do.

I'll keep thinking about this...

Julian

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-05 18:07     ` Janne Blomqvist
@ 2018-09-19 16:38       ` Andrew Stubbs
  2018-09-19 22:27         ` Damian Rouson
  2018-09-20 15:59         ` [PATCH 08/25] Fix co-array allocation Janne Blomqvist
  0 siblings, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-19 16:38 UTC (permalink / raw)
  To: Janne Blomqvist, Toon Moene, GCC Patches, Fortran List

[-- Attachment #1: Type: text/plain, Size: 789 bytes --]

On 05/09/18 19:07, Janne Blomqvist wrote:
> The argument must be of type size_type_node, not sizetype. Please instead
> use
> 
> size = build_zero_cst (size_type_node);
> 
> 
>>          * trans-intrinsic.c (conv_intrinsic_event_query): Convert computed
>>          index to a size_t type.
>>
> 
> Using integer_type_node is wrong, but the correct type for calculating
> array indices (lbound, ubound,  etc.) is not size_type_node but rather
> gfc_array_index_type (which in practice maps to ptrdiff_t). So please use
> that, and then fold_convert index to size_type_node just before generating
> the call to event_query.
> 
> 
>>          * trans-stmt.c (gfc_trans_event_post_wait): Likewise.
>>
> 
> Same here as above.

How is the attached? I retested and found no regressions.

Andrew

[-- Attachment #2: 180919-fix-co-array-allocation.patch --]
[-- Type: text/x-patch, Size: 3341 bytes --]

Fix co-array allocation

The Fortran front-end has a bug in which it uses "int" values for "size_t"
parameters.  I don't know why this isn't problem for all 64-bit architectures,
but GCN ends up with the data in the wrong argument register and/or stack slot,
and bad things happen.

This patch corrects the issue by setting the correct type.

2018-09-19  Andrew Stubbs  <ams@codesourcery.com>
            Kwok Cheung Yeung  <kcy@codesourcery.com>

	gcc/fortran/
	* trans-expr.c (gfc_trans_structure_assign): Ensure that the first
	argument of a call to _gfortran_caf_register is of size_type_node.
	* trans-intrinsic.c (conv_intrinsic_event_query): Convert computed
	index to a size_type_node type.
	* trans-stmt.c (gfc_trans_event_post_wait): Likewise.

diff --git a/gcc/fortran/trans-expr.c b/gcc/fortran/trans-expr.c
index 56ce98c..28079ac 100644
--- a/gcc/fortran/trans-expr.c
+++ b/gcc/fortran/trans-expr.c
@@ -7729,7 +7729,7 @@ gfc_trans_structure_assign (tree dest, gfc_expr * expr, bool init, bool coarray)
 		 suffices to recognize the data as array.  */
 	      if (rank < 0)
 		rank = 1;
-	      size = integer_zero_node;
+	      size = build_zero_cst (size_type_node);
 	      desc = field;
 	      gfc_add_modify (&block, gfc_conv_descriptor_rank (desc),
 			      build_int_cst (signed_char_type_node, rank));
diff --git a/gcc/fortran/trans-intrinsic.c b/gcc/fortran/trans-intrinsic.c
index b2cea93..569435d 100644
--- a/gcc/fortran/trans-intrinsic.c
+++ b/gcc/fortran/trans-intrinsic.c
@@ -10732,7 +10732,9 @@ conv_intrinsic_event_query (gfc_code *code)
 	      tmp = fold_build2_loc (input_location, MULT_EXPR,
 				     integer_type_node, extent, tmp);
 	      index = fold_build2_loc (input_location, PLUS_EXPR,
-				       integer_type_node, index, tmp);
+				       gfc_array_index_type, index,
+				       fold_convert (gfc_array_index_type,
+						     tmp));
 	      if (i < ar->dimen - 1)
 		{
 		  ubound = gfc_conv_descriptor_ubound_get (desc, gfc_rank_cst[i]);
@@ -10756,6 +10758,7 @@ conv_intrinsic_event_query (gfc_code *code)
 	  stat = gfc_create_var (integer_type_node, "stat");
 	}
 
+      index = fold_convert (size_type_node, index);
       tmp = build_call_expr_loc (input_location, gfor_fndecl_caf_event_query, 5,
                                    token, index, image_index, count
 				   ? gfc_build_addr_expr (NULL, count) : count,
diff --git a/gcc/fortran/trans-stmt.c b/gcc/fortran/trans-stmt.c
index 795d3cc..92d9c37 100644
--- a/gcc/fortran/trans-stmt.c
+++ b/gcc/fortran/trans-stmt.c
@@ -1096,7 +1096,8 @@ gfc_trans_event_post_wait (gfc_code *code, gfc_exec_op op)
 	  tmp = fold_build2_loc (input_location, MULT_EXPR,
 				 integer_type_node, extent, tmp);
 	  index = fold_build2_loc (input_location, PLUS_EXPR,
-				   integer_type_node, index, tmp);
+				   gfc_array_index_type, index,
+				   fold_convert (gfc_array_index_type, tmp));
 	  if (i < ar->dimen - 1)
 	    {
 	      ubound = gfc_conv_descriptor_ubound_get (desc, gfc_rank_cst[i]);
@@ -1130,6 +1131,7 @@ gfc_trans_event_post_wait (gfc_code *code, gfc_exec_op op)
       stat = gfc_create_var (integer_type_node, "stat");
     }
 
+  index = fold_convert (size_type_node, index);
   if (op == EXEC_EVENT_POST)
     tmp = build_call_expr_loc (input_location, gfor_fndecl_caf_event_post, 6,
 			       token, index, image_index,

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-09-11 22:46   ` Jeff Law
  2018-09-12  8:47     ` Andrew Stubbs
@ 2018-09-19 17:25     ` Andrew Stubbs
  2018-09-20 11:42       ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-19 17:25 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]

On 11/09/18 23:45, Jeff Law wrote:
> On 9/5/18 5:49 AM, ams@codesourcery.com wrote:
>>
>> GCN's 64-lane vectors tend to make RTL dumps very long.  This patch makes them
>> far more bearable by eliding long sequences of the same element into "repeated"
>> messages.
>>
>> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>> 	    Jan Hubicka  <jh@suse.cz>
>> 	    Martin Jambor  <mjambor@suse.cz>
>>
>> 	* print-rtl.c (print_rtx_operand_codes_E_and_V): Print how many times
>> 	the same elements are repeated rather than printing all of them.
> Does this need a corresponding change to the RTL front-end so that it
> can read the new form?

Here's an updated patch incorporating the RTL front-end changes. I had 
to change from "repeated 2x" to "repeated x2" because the former is not 
a valid C token, and apparently that's important.

I've confirmed that it can read RTL and that subsequent dumps look correct.

OK?

Andrew

[-- Attachment #2: 180919-elide-repeated-RTL-elements.patch --]
[-- Type: text/x-patch, Size: 2650 bytes --]

Elide repeated RTL elements.

GCN's 64-lane vectors tend to make RTL dumps very long.  This patch makes them
far more bearable by eliding long sequences of the same element into "repeated"
messages.

This also takes care of reading repeated sequences in the RTL front-end.

2018-09-19  Andrew Stubbs  <ams@codesourcery.com>
	    Jan Hubicka  <jh@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	gcc.
	* print-rtl.c (print_rtx_operand_codes_E_and_V): Print how many times
	the same elements are repeated rather than printing all of them.
	* read-rtl.c (rtx_reader::read_rtx_operand): Recognize and expand
	"repeated" elements.

diff --git a/gcc/print-rtl.c b/gcc/print-rtl.c
index 5dd2e31..1228483 100644
--- a/gcc/print-rtl.c
+++ b/gcc/print-rtl.c
@@ -370,7 +370,20 @@ rtx_writer::print_rtx_operand_codes_E_and_V (const_rtx in_rtx, int idx)
 	m_sawclose = 1;
 
       for (int j = 0; j < XVECLEN (in_rtx, idx); j++)
-	print_rtx (XVECEXP (in_rtx, idx, j));
+	{
+	  int j1;
+
+	  print_rtx (XVECEXP (in_rtx, idx, j));
+	  for (j1 = j + 1; j1 < XVECLEN (in_rtx, idx); j1++)
+	    if (XVECEXP (in_rtx, idx, j) != XVECEXP (in_rtx, idx, j1))
+	      break;
+
+	  if (j1 != j + 1)
+	    {
+	      fprintf (m_outfile, " repeated x%i", j1 - j);
+	      j = j1 - 1;
+	    }
+	}
 
       m_indent -= 2;
     }
diff --git a/gcc/read-rtl.c b/gcc/read-rtl.c
index 723c3e1..7ede18f 100644
--- a/gcc/read-rtl.c
+++ b/gcc/read-rtl.c
@@ -1690,6 +1690,8 @@ rtx_reader::read_rtx_operand (rtx return_rtx, int idx)
 	struct obstack vector_stack;
 	int list_counter = 0;
 	rtvec return_vec = NULL_RTVEC;
+	rtx saved_rtx = NULL_RTX;
+	int repeat_count = 0;
 
 	require_char_ws ('[');
 
@@ -1700,8 +1702,34 @@ rtx_reader::read_rtx_operand (rtx return_rtx, int idx)
 	    if (c == EOF)
 	      fatal_expected_char (']', c);
 	    unread_char (c);
+
+	    rtx value;
+	    if (repeat_count <= 0 && c == 'r')
+	      {
+		/* Process "repeated Nx" directive.  */
+		read_name (&name);
+		if (strcmp (name.string, "repeated"))
+		  fatal_with_file_and_line ("invalid directive \"%s\"\n",
+					    name.string);
+		read_name (&name);
+		if (!sscanf (name.string, "x%d", &repeat_count))
+		  fatal_with_file_and_line ("invalid repeat count \"%s\"\n",
+					    name.string);
+
+		/* We already saw one of the instances.  */
+		repeat_count--;
+	      }
+	    if (repeat_count > 0)
+	      {
+		repeat_count--;
+		value = saved_rtx;
+	      }
+	    else
+	      value = read_nested_rtx ();
+
 	    list_counter++;
-	    obstack_ptr_grow (&vector_stack, read_nested_rtx ());
+	    obstack_ptr_grow (&vector_stack, value);
+	    saved_rtx = value;
 	  }
 	if (list_counter > 0)
 	  {

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-19 16:38       ` Andrew Stubbs
@ 2018-09-19 22:27         ` Damian Rouson
  2018-09-19 22:55           ` Andrew Stubbs
  2018-09-20 20:49           ` Thomas Koenig
  2018-09-20 15:59         ` [PATCH 08/25] Fix co-array allocation Janne Blomqvist
  1 sibling, 2 replies; 187+ messages in thread
From: Damian Rouson @ 2018-09-19 22:27 UTC (permalink / raw)
  To: ams; +Cc: Janne Blomqvist, Toon Moene, gcc patches, gfortran

Has this been tested in multi-image execution using OpenCoarrays?   If not,
I would be glad to assist with installing OpenCoarrays so that it can be
part of the testing process.

On a related note, two Sourcery Institute developers have attempted to edit
the GCC build system to make the downloading and building of OpenCoarrays
automatically part of the gfortran build process.  Neither developer
succeeded.  If anyone has any interest in figuring out how to do this, it
will prevent a lot of potential regressions when single-image testing
doesn't expose issues that only arise with multi-image execution.

Damian

On Wed, Sep 19, 2018 at 9:25 AM Andrew Stubbs <ams@codesourcery.com> wrote:

> On 05/09/18 19:07, Janne Blomqvist wrote:
> > The argument must be of type size_type_node, not sizetype. Please instead
> > use
> >
> > size = build_zero_cst (size_type_node);
> >
> >
> >>          * trans-intrinsic.c (conv_intrinsic_event_query): Convert
> computed
> >>          index to a size_t type.
> >>
> >
> > Using integer_type_node is wrong, but the correct type for calculating
> > array indices (lbound, ubound,  etc.) is not size_type_node but rather
> > gfc_array_index_type (which in practice maps to ptrdiff_t). So please use
> > that, and then fold_convert index to size_type_node just before
> generating
> > the call to event_query.
> >
> >
> >>          * trans-stmt.c (gfc_trans_event_post_wait): Likewise.
> >>
> >
> > Same here as above.
>
> How is the attached? I retested and found no regressions.
>
> Andrew
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-19 22:27         ` Damian Rouson
@ 2018-09-19 22:55           ` Andrew Stubbs
  2018-09-20  1:21             ` Damian Rouson
  2018-09-20 20:49           ` Thomas Koenig
  1 sibling, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-19 22:55 UTC (permalink / raw)
  To: Damian Rouson; +Cc: Janne Blomqvist, Toon Moene, gcc patches, gfortran

On 19/09/18 22:18, Damian Rouson wrote:
> Has this been tested in multi-image execution using OpenCoarrays?   If 
> not, I would be glad to assist with installing OpenCoarrays so that it 
> can be part of the testing process.

It's been tested with the GCC testsuite -- the same suite that found the 
issue in the first place.

If you want to port your tool to GCN that would be cool, but I suspect 
non-trivial.

> On a related note, two Sourcery Institute developers have attempted to 
> edit the GCC build system to make the downloading and building of 
> OpenCoarrays automatically part of the gfortran build process.  Neither 
> developer succeeded.  If anyone has any interest in figuring out how to 
> do this, it will prevent a lot of potential regressions when 
> single-image testing doesn't expose issues that only arise with 
> multi-image execution.

I suggest you post this question in a fresh thread.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-19 22:55           ` Andrew Stubbs
@ 2018-09-20  1:21             ` Damian Rouson
  0 siblings, 0 replies; 187+ messages in thread
From: Damian Rouson @ 2018-09-20  1:21 UTC (permalink / raw)
  To: ams; +Cc: Janne Blomqvist, Toon Moene, gcc patches, gfortran

On Wed, Sep 19, 2018 at 3:30 PM Andrew Stubbs <ams@codesourcery.com> wrote:

>
> If you want to port your tool to GCN that would be cool, but I suspect
> non-trivial.
>

To clarify, OpenCoarrays is not a tool.  It is the parallel ABI required to
create executable programs capable of executing in multiple images as
required by the Fortran 2008 standard.  Multi-image execution is the reason
coarray features exist so I hope the maintainers won't approve a patch that
impacts coarray features but has not been tested against OpenCoarrays,
which has its own test suite.  Again, I would be glad to assist with
installing OpenCoarrays on your system. Whether it's trivial or not, it's
essential to protect against breaking a large feature set that is part of
Fortran 2008 and 2018.

Damian

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-09-19 17:25     ` Andrew Stubbs
@ 2018-09-20 11:42       ` Andrew Stubbs
  2018-09-26 16:23         ` Andrew Stubbs
  2018-10-04 18:24         ` Jeff Law
  0 siblings, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-20 11:42 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 389 bytes --]

On 19/09/18 17:38, Andrew Stubbs wrote:
> Here's an updated patch incorporating the RTL front-end changes. I had 
> to change from "repeated 2x" to "repeated x2" because the former is not 
> a valid C token, and apparently that's important.

Here's a patch with self tests added, for both reading and writing.

It also fixes a bug when the repeat was the last item in a list.

OK?

Andrew

[-- Attachment #2: 180920-elide-repeated-RTL-elements.patch --]
[-- Type: text/x-patch, Size: 5451 bytes --]

Elide repeated RTL elements.

GCN's 64-lane vectors tend to make RTL dumps very long.  This patch makes them
far more bearable by eliding long sequences of the same element into "repeated"
messages.

This also takes care of reading repeated sequences in the RTL front-end.

There are self tests for both reading and writing.

2018-09-20  Andrew Stubbs  <ams@codesourcery.com>
	    Jan Hubicka  <jh@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	gcc/
	* print-rtl.c (print_rtx_operand_codes_E_and_V): Print how many times
	the same elements are repeated rather than printing all of them.
	* read-rtl.c (rtx_reader::read_rtx_operand): Recognize and expand
	"repeated" elements.
	* read-rtl-function.c (test_loading_repeat): New function.
	(read_rtl_function_c_tests): Call test_loading_repeat.
	* rtl-tests.c (test_dumping_repeat): New function.
	(rtl_tests_c_tests): Call test_dumping_repeat.

	gcc/testsuite/
	* selftests/repeat.rtl: New file.

diff --git a/gcc/print-rtl.c b/gcc/print-rtl.c
index 5dd2e31..1228483 100644
--- a/gcc/print-rtl.c
+++ b/gcc/print-rtl.c
@@ -370,7 +370,20 @@ rtx_writer::print_rtx_operand_codes_E_and_V (const_rtx in_rtx, int idx)
 	m_sawclose = 1;
 
       for (int j = 0; j < XVECLEN (in_rtx, idx); j++)
-	print_rtx (XVECEXP (in_rtx, idx, j));
+	{
+	  int j1;
+
+	  print_rtx (XVECEXP (in_rtx, idx, j));
+	  for (j1 = j + 1; j1 < XVECLEN (in_rtx, idx); j1++)
+	    if (XVECEXP (in_rtx, idx, j) != XVECEXP (in_rtx, idx, j1))
+	      break;
+
+	  if (j1 != j + 1)
+	    {
+	      fprintf (m_outfile, " repeated x%i", j1 - j);
+	      j = j1 - 1;
+	    }
+	}
 
       m_indent -= 2;
     }
diff --git a/gcc/read-rtl-function.c b/gcc/read-rtl-function.c
index cde9d3e..8746f70 100644
--- a/gcc/read-rtl-function.c
+++ b/gcc/read-rtl-function.c
@@ -2166,6 +2166,20 @@ test_loading_mem ()
   ASSERT_EQ (6, MEM_ADDR_SPACE (mem2));
 }
 
+/* Verify that "repeated xN" is read correctly.  */
+
+static void
+test_loading_repeat ()
+{
+  rtl_dump_test t (SELFTEST_LOCATION, locate_file ("repeat.rtl"));
+
+  rtx_insn *insn_1 = get_insn_by_uid (1);
+  ASSERT_EQ (PARALLEL, GET_CODE (PATTERN (insn_1)));
+  ASSERT_EQ (64, XVECLEN (PATTERN (insn_1), 0));
+  for (int i = 0; i < 64; i++)
+    ASSERT_EQ (const0_rtx, XVECEXP (PATTERN (insn_1), 0, i));
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -2187,6 +2201,7 @@ read_rtl_function_c_tests ()
   test_loading_cfg ();
   test_loading_bb_index ();
   test_loading_mem ();
+  test_loading_repeat ();
 }
 
 } // namespace selftest
diff --git a/gcc/read-rtl.c b/gcc/read-rtl.c
index 723c3e1..d698dd4 100644
--- a/gcc/read-rtl.c
+++ b/gcc/read-rtl.c
@@ -1690,6 +1690,7 @@ rtx_reader::read_rtx_operand (rtx return_rtx, int idx)
 	struct obstack vector_stack;
 	int list_counter = 0;
 	rtvec return_vec = NULL_RTVEC;
+	rtx saved_rtx = NULL_RTX;
 
 	require_char_ws ('[');
 
@@ -1700,8 +1701,34 @@ rtx_reader::read_rtx_operand (rtx return_rtx, int idx)
 	    if (c == EOF)
 	      fatal_expected_char (']', c);
 	    unread_char (c);
-	    list_counter++;
-	    obstack_ptr_grow (&vector_stack, read_nested_rtx ());
+
+	    rtx value;
+	    int repeat_count = 1;
+	    if (c == 'r')
+	      {
+		/* Process "repeated xN" directive.  */
+		read_name (&name);
+		if (strcmp (name.string, "repeated"))
+		  fatal_with_file_and_line ("invalid directive \"%s\"\n",
+					    name.string);
+		read_name (&name);
+		if (!sscanf (name.string, "x%d", &repeat_count))
+		  fatal_with_file_and_line ("invalid repeat count \"%s\"\n",
+					    name.string);
+
+		/* We already saw one of the instances.  */
+		repeat_count--;
+		value = saved_rtx;
+	      }
+	    else
+	      value = read_nested_rtx ();
+
+	    for (; repeat_count > 0; repeat_count--)
+	      {
+		list_counter++;
+		obstack_ptr_grow (&vector_stack, value);
+	      }
+	    saved_rtx = value;
 	  }
 	if (list_counter > 0)
 	  {
diff --git a/gcc/rtl-tests.c b/gcc/rtl-tests.c
index f67f2a3..c684f8e 100644
--- a/gcc/rtl-tests.c
+++ b/gcc/rtl-tests.c
@@ -284,6 +284,29 @@ const_poly_int_tests<N>::run ()
 	     gen_int_mode (poly_int64 (5, -1), QImode));
 }
 
+/* Check dumping of repeated RTL vectors.  */
+
+static void
+test_dumping_repeat ()
+{
+  rtx p = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (3));
+  XVECEXP (p, 0, 0) = const0_rtx;
+  XVECEXP (p, 0, 1) = const0_rtx;
+  XVECEXP (p, 0, 2) = const0_rtx;
+  ASSERT_RTL_DUMP_EQ ("(parallel [\n"
+		      "        (const_int 0) repeated x3\n"
+		      "    ])",
+		      p);
+
+  XVECEXP (p, 0, 1) = const1_rtx;
+  ASSERT_RTL_DUMP_EQ ("(parallel [\n"
+		      "        (const_int 0)\n"
+		      "        (const_int 1)\n"
+		      "        (const_int 0)\n"
+		      "    ])",
+		      p);
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -295,6 +318,7 @@ rtl_tests_c_tests ()
   test_single_set ();
   test_uncond_jump ();
   const_poly_int_tests<NUM_POLY_INT_COEFFS>::run ();
+  test_dumping_repeat ();
 
   /* Purge state.  */
   set_first_insn (NULL);
diff --git a/gcc/testsuite/selftests/repeat.rtl b/gcc/testsuite/selftests/repeat.rtl
new file mode 100644
index 0000000..5507d33
--- /dev/null
+++ b/gcc/testsuite/selftests/repeat.rtl
@@ -0,0 +1,11 @@
+(function "repeat_examples"
+  (insn-chain
+    (block 2
+      (edge-from entry (flags "FALLTHRU"))
+      (cinsn 1
+        (parallel [(const_int 0) repeated x64])
+        "test.c":2 (nil))
+      (edge-to exit (flags "FALLTHRU"))
+    ) ;; block 2
+  ) ;; insn-chain
+) ;; function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME.
  2018-09-19 15:23                 ` Julian Brown
@ 2018-09-20 12:36                   ` Richard Biener
  0 siblings, 0 replies; 187+ messages in thread
From: Richard Biener @ 2018-09-20 12:36 UTC (permalink / raw)
  To: Julian Brown; +Cc: Stubbs, Andrew, Jeff Law, Jan Hubicka, GCC Patches

On Wed, Sep 19, 2018 at 5:11 PM Julian Brown <julian@codesourcery.com> wrote:
>
> On Fri, 14 Sep 2018 22:49:35 -0400
> Julian Brown <julian@codesourcery.com> wrote:
>
> > > > On 12/09/18 16:16, Richard Biener wrote:
> > > > It may well be that there's a better way to solve the problem, or
> > > > at least to do the lookups.
> > > >
> > > > It may also be that there are some unintended consequences, such
> > > > as false name matches, but I don't know of any at present.
>
> > > Possibly, this was an abuse of these hooks, but it's arguably wrong
> > > that that e.g. handle_alias_pairs has the "assembler name" leak
> > > through into the user's source code -- if it's expected that the
> > > hook could make arbitrary transformations to the string. (The
> > > latter hook is only used by PE code for x86 at present, by the look
> > > of it, and the default handles only special-purpose mangling
> > > indicated by placing a '*' at the front of the symbol.)
>
> Two places I've found that currently expose the underlying symbol name
> in the user's source code: one (documented!) is C++, where one must
> write the mangled symbol name as the alias target:
>
> int foo (int c) { ... }
> int bar (int) __attribute__((alias("_Z3fooi")));
>
> another (perhaps obscure) is x86/PE with "fastcall":
>
> __attribute__((fastcall)) void foo(void) { ... }
> void bar(void) __attribute__((alias("@foo@0")));
>
> both of which probably suggest that using the decl name, rather than
> demangling the assembler name (or using some completely different
> solution) was the wrong thing to do.
>
> I'll keep thinking about this...

Thanks, IIRC we already have some targets with quite complex renaming
where I wonder iff uses like above work correctly.

Btw, if you don't "fix" the handle_alias_paris code but keep your mangling
what does break for you in practice (apart from maybe some testcases)?

Richard.

> Julian

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 16/25] Fix IRA ICE.
  2018-09-18 22:00     ` Andrew Stubbs
@ 2018-09-20 12:47       ` Richard Sandiford
  2018-09-20 13:36         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-20 12:47 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 10:22, Richard Sandiford wrote:
>> <ams@codesourcery.com> writes:
>>> The IRA pass makes an assumption that any pseudos created after the
>>> pass begins
>>> were created explicitly by the pass itself and therefore will have
>>> corresponding entries in its other tables.
>>>
>>> The GCN back-end, however, often creates additional pseudos, in expand
>>> patterns, to represent the necessary EXEC value, and these break IRA's
>>> assumption and cause ICEs.
>>>
>>> This patch simply has IRA skip unknown pseudos, and the problem goes away.
>>>
>>> Presumably, it's not ideal that these registers have not been
>>> processed by IRA,
>>> but it does not appear to do any real harm.
>> 
>> Could you go into more detail about how this happens?  Other targets
>> also create pseudos in their move patterns.
>
> Here's a simplified snippet from the machine description:
>
> (define_expand "mov<mode>" 
>  
>  
>
>    [(set (match_operand:VEC_REG_MODE 0 "nonimmediate_operand") 
>  
>  
>
>          (match_operand:VEC_REG_MODE 1 "general_operand"))] 
>  
>  
>
>    "" 
>  
>  
>
>    { 
>  
>  
>
>      [...]
>  
>  
>  
>
>      if (can_create_pseudo_p ()) 
>  
>  
>
>        { 
>  
>  
>
>          rtx exec = gcn_full_exec_reg ();
>          rtx undef = gcn_gen_undef (<MODE>mode);
>  
>
>          [...]
>
> 	emit_insn (gen_mov<mode>_vector (operands[0], operands[1], exec
>                                           undef));
>          [...]
>
>          DONE;
>        }
>    })
>
> gcn_full_exec_reg creates a new pseudo. It gets used as the mask 
> parameter of a vec_merge.
>
> These registers then trip the asserts in ira.c.
>
> In the case of setup_preferred_alternate_classes_for_new_pseudos it's 
> because they have numbers greater than "start" but have not been 
> initialized with different ORIGINAL_REGNO (why would they have been?)
>
> In the case of move_unallocated_pseudos it's because the table 
> pseudo_replaced_reg only has entries for the new pseudos directly 
> created by find_moveable_pseudos, not the ones created indirectly.

What I more meant was: where do the moves that introduce the new
pseudos get created?

Almost all targets' move patterns introduce new pseudos if
can_create_pseudo_p in certain circumstances, so GCN isn't doing
anything unusual in the outline above.  I think it comes down to
the specifics of which kinds of operands require these temporaries
and where the moves are being introduced.

AIUI IRA normally calls expand_reg_info () at a suitable point
to cope with new pseudos.  It sounds like we might be missing
a call somewhere.

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-17 12:46         ` Andrew Stubbs
@ 2018-09-20 13:01           ` Richard Biener
  2018-09-20 13:51             ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-20 13:01 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: GCC Patches, Richard Sandiford

On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 17/09/18 12:43, Richard Sandiford wrote:
> > OK, sounds like the cost of vec_construct is too low then.  But looking
> > at the port, I see you have:
> >
> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
> >
> > int
> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
> > {
> >    /* Always vectorize.  */
> >    return 1;
> > }
> >
> > which short-circuits the cost-model altogether.  Isn't that part
> > of the problem?
>
> Well, it's possible that that's a little simplistic. ;-)
>
> Although, actually the elementwise issue predates the existence of
> gcn_vectorization_cost, and the default does appear to penalize
> vec_construct somewhat.
>
> Actually, the default definition doesn't seem to do much besides
> increase vec_construct, so I'm not sure now why I needed to change it?
> Hmm, more experiments to do.
>
> Thanks for the pointer.

Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
that's a missed "optimization" quite possibly because gather/scatter is so
expensive on x86.  Thus the vectorizer should consider this and use the
cheaper alternative according to the cost model (which you of course should
fill with sensible values...).

Richard.

> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2018-09-05 11:49 ` [PATCH 02/25] Propagate address spaces to builtins ams
@ 2018-09-20 13:09   ` Richard Biener
  2018-09-22 19:22   ` Andreas Schwab
  2019-09-03 14:01   ` [PATCH 02/25] Propagate address spaces to builtins Kyrill Tkachov
  2 siblings, 0 replies; 187+ messages in thread
From: Richard Biener @ 2018-09-20 13:09 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: GCC Patches

On Wed, Sep 5, 2018 at 1:50 PM <ams@codesourcery.com> wrote:
>
>
> At present, pointers passed to builtin functions, including atomic operators,
> are stripped of their address space properties.  This doesn't seem to be
> deliberate, it just omits to copy them.
>
> Not only that, but it forces pointer sizes to Pmode, which isn't appropriate
> for all address spaces.
>
> This patch attempts to correct both issues.  It works for GCN atomics and
> GCN OpenACC gang-private variables.

OK.

Richard.

> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>             Julian Brown  <julian@codesourcery.com>
>
>         gcc/
>         * builtins.c (get_builtin_sync_mem): Handle address spaces.
> ---
>  gcc/builtins.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 16/25] Fix IRA ICE.
  2018-09-20 12:47       ` Richard Sandiford
@ 2018-09-20 13:36         ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-20 13:36 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 20/09/18 13:46, Richard Sandiford wrote:
> Andrew Stubbs <ams@codesourcery.com> writes:
>> In the case of move_unallocated_pseudos it's because the table
>> pseudo_replaced_reg only has entries for the new pseudos directly
>> created by find_moveable_pseudos, not the ones created indirectly.
> 
> What I more meant was: where do the moves that introduce the new
> pseudos get created?

For find_moveable_pseudos, I believe it's where it calls gen_move_insn.

> Almost all targets' move patterns introduce new pseudos if
> can_create_pseudo_p in certain circumstances, so GCN isn't doing
> anything unusual in the outline above.  I think it comes down to
> the specifics of which kinds of operands require these temporaries
> and where the moves are being introduced.

GCN creates new pseudos for all vector moves. Maybe that's just less 
exotic than other targets do?

> AIUI IRA normally calls expand_reg_info () at a suitable point
> to cope with new pseudos.  It sounds like we might be missing
> a call somewhere.

Yes, it does, but one of the places I had to patch is *within* 
expand_reg_info: it's setup_preferred_alternate_classes_for_new_pseudos 
that asserts for pseudos created by gen_move_insn.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-20 13:01           ` Richard Biener
@ 2018-09-20 13:51             ` Richard Sandiford
  2018-09-20 14:14               ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-20 13:51 UTC (permalink / raw)
  To: Richard Biener; +Cc: Stubbs, Andrew, GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 17/09/18 12:43, Richard Sandiford wrote:
>> > OK, sounds like the cost of vec_construct is too low then.  But looking
>> > at the port, I see you have:
>> >
>> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
>> >
>> > int
>> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
>> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
>> > {
>> >    /* Always vectorize.  */
>> >    return 1;
>> > }
>> >
>> > which short-circuits the cost-model altogether.  Isn't that part
>> > of the problem?
>>
>> Well, it's possible that that's a little simplistic. ;-)
>>
>> Although, actually the elementwise issue predates the existence of
>> gcn_vectorization_cost, and the default does appear to penalize
>> vec_construct somewhat.
>>
>> Actually, the default definition doesn't seem to do much besides
>> increase vec_construct, so I'm not sure now why I needed to change it?
>> Hmm, more experiments to do.
>>
>> Thanks for the pointer.
>
> Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
> that's a missed "optimization" quite possibly because gather/scatter is so
> expensive on x86.  Thus the vectorizer should consider this and use the
> cheaper alternative according to the cost model (which you of course should
> fill with sensible values...).

Do you mean it this way round, or that it doesn't consider using
VMAT_ELEMENTWISE for natural gather/scatter accesses?  We do use
VMAT_GATHER_SCATTER instead of VMAT_ELEMENTWISE where possible for SVE,
but that relies on implementing the new optabs instead of using the old
built-in-based interface, so it doesn't work for x86 yet.

I guess we might need some way of selecting between the two if
the costs of gather and scatter are context-dependent in some way.
But if gather/scatter is always more expensive than VMAT_ELEMENTWISE
for certain modes then it's probably better not to define the optabs
for those modes.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-20 13:51             ` Richard Sandiford
@ 2018-09-20 14:14               ` Richard Biener
  2018-09-20 14:22                 ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-20 14:14 UTC (permalink / raw)
  To: Stubbs, Andrew, GCC Patches, Richard Sandiford

On Thu, Sep 20, 2018 at 3:40 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
> >> On 17/09/18 12:43, Richard Sandiford wrote:
> >> > OK, sounds like the cost of vec_construct is too low then.  But looking
> >> > at the port, I see you have:
> >> >
> >> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
> >> >
> >> > int
> >> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
> >> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
> >> > {
> >> >    /* Always vectorize.  */
> >> >    return 1;
> >> > }
> >> >
> >> > which short-circuits the cost-model altogether.  Isn't that part
> >> > of the problem?
> >>
> >> Well, it's possible that that's a little simplistic. ;-)
> >>
> >> Although, actually the elementwise issue predates the existence of
> >> gcn_vectorization_cost, and the default does appear to penalize
> >> vec_construct somewhat.
> >>
> >> Actually, the default definition doesn't seem to do much besides
> >> increase vec_construct, so I'm not sure now why I needed to change it?
> >> Hmm, more experiments to do.
> >>
> >> Thanks for the pointer.
> >
> > Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
> > that's a missed "optimization" quite possibly because gather/scatter is so
> > expensive on x86.  Thus the vectorizer should consider this and use the
> > cheaper alternative according to the cost model (which you of course should
> > fill with sensible values...).
>
> Do you mean it this way round, or that it doesn't consider using
> VMAT_ELEMENTWISE for natural gather/scatter accesses?  We do use
> VMAT_GATHER_SCATTER instead of VMAT_ELEMENTWISE where possible for SVE,
> but that relies on implementing the new optabs instead of using the old
> built-in-based interface, so it doesn't work for x86 yet.
>
> I guess we might need some way of selecting between the two if
> the costs of gather and scatter are context-dependent in some way.
> But if gather/scatter is always more expensive than VMAT_ELEMENTWISE
> for certain modes then it's probably better not to define the optabs
> for those modes.

I think we can't vectorize true gathers (indexed from memory loads) w/o
gather yet, right?  So I really was thinking of implementing VMAT_ELEMENTWISE
(invariant stride) and VMAT_STRIDED_SLP by composing the appropriate
index vector with a splat and multiplication and using a gather.  I think that's
not yet implemented?

But yes, vectorizing gathers as detected by dataref analysis w/o native gather
support would also be interesting.  We can do that by doing elementwise
loads and either load the indexes also elementwise or decompose the vector
of indexes (dependent on how that vector is computed).

Richard.

>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores.
  2018-09-20 14:14               ` Richard Biener
@ 2018-09-20 14:22                 ` Richard Sandiford
  0 siblings, 0 replies; 187+ messages in thread
From: Richard Sandiford @ 2018-09-20 14:22 UTC (permalink / raw)
  To: Richard Biener; +Cc: Stubbs, Andrew, GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Thu, Sep 20, 2018 at 3:40 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>> >> On 17/09/18 12:43, Richard Sandiford wrote:
>> >> > OK, sounds like the cost of vec_construct is too low then.  But looking
>> >> > at the port, I see you have:
>> >> >
>> >> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
>> >> >
>> >> > int
>> >> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
>> >> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
>> >> > {
>> >> >    /* Always vectorize.  */
>> >> >    return 1;
>> >> > }
>> >> >
>> >> > which short-circuits the cost-model altogether.  Isn't that part
>> >> > of the problem?
>> >>
>> >> Well, it's possible that that's a little simplistic. ;-)
>> >>
>> >> Although, actually the elementwise issue predates the existence of
>> >> gcn_vectorization_cost, and the default does appear to penalize
>> >> vec_construct somewhat.
>> >>
>> >> Actually, the default definition doesn't seem to do much besides
>> >> increase vec_construct, so I'm not sure now why I needed to change it?
>> >> Hmm, more experiments to do.
>> >>
>> >> Thanks for the pointer.
>> >
>> > Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
>> > that's a missed "optimization" quite possibly because gather/scatter is so
>> > expensive on x86.  Thus the vectorizer should consider this and use the
>> > cheaper alternative according to the cost model (which you of course should
>> > fill with sensible values...).
>>
>> Do you mean it this way round, or that it doesn't consider using
>> VMAT_ELEMENTWISE for natural gather/scatter accesses?  We do use
>> VMAT_GATHER_SCATTER instead of VMAT_ELEMENTWISE where possible for SVE,
>> but that relies on implementing the new optabs instead of using the old
>> built-in-based interface, so it doesn't work for x86 yet.
>>
>> I guess we might need some way of selecting between the two if
>> the costs of gather and scatter are context-dependent in some way.
>> But if gather/scatter is always more expensive than VMAT_ELEMENTWISE
>> for certain modes then it's probably better not to define the optabs
>> for those modes.
>
> I think we can't vectorize true gathers (indexed from memory loads) w/o
> gather yet, right?

Right.

> So I really was thinking of implementing VMAT_ELEMENTWISE (invariant
> stride) and VMAT_STRIDED_SLP by composing the appropriate index vector
> with a splat and multiplication and using a gather.  I think that's
> not yet implemented?

For SVE we use:

      /* As a last resort, trying using a gather load or scatter store.

	 ??? Although the code can handle all group sizes correctly,
	 it probably isn't a win to use separate strided accesses based
	 on nearby locations.  Or, even if it's a win over scalar code,
	 it might not be a win over vectorizing at a lower VF, if that
	 allows us to use contiguous accesses.  */
      if (*memory_access_type == VMAT_ELEMENTWISE
	  && single_element_p
	  && loop_vinfo
	  && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
						 masked_p, gs_info))
	*memory_access_type = VMAT_GATHER_SCATTER;

in get_group_load_store_type.  This only works when the target defines
gather/scatter using optabs rather than built-ins.

But yeah, no VMAT_STRIDED_SLP support yet.  That would be good
to have...

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-17  9:08   ` Richard Sandiford
@ 2018-09-20 15:44     ` Andrew Stubbs
  2018-09-26 16:26       ` Andrew Stubbs
  2018-09-26 16:50       ` Richard Sandiford
  0 siblings, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-20 15:44 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 498 bytes --]

On 17/09/18 10:05, Richard Sandiford wrote:
> Would be good to have self-tests for the new transforms.
[...]
> known_eq, since we require equality for correctness.  Same for the
> other tests.

How about the attached? I've made the edits you requested and written 
some self-tests.

> Doesn't simplify_merge_mask make the second two redundant?  I couldn't
> see the difference between them and the first condition tested by
> simplify_merge_mask.

Yes, I think you're right. Removed, now.

Andrew


[-- Attachment #2: 180920-Simplify-vec_merge-according-to-the-mask.patch --]
[-- Type: text/x-patch, Size: 6618 bytes --]

Simplify vec_merge according to the mask.

This patch was part of the original patch we acquired from Honza and Martin.

It simplifies nested vec_merge operations using the same mask.

Self-tests are included.

2018-09-20  Andrew Stubbs  <ams@codesourcery.com>
	    Jan Hubicka  <jh@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	* simplify-rtx.c (simplify_merge_mask): New function.
	(simplify_ternary_operation): Use it, also see if VEC_MERGEs with the
	same masks are used in op1 or op2.
	(test_vec_merge): New function.
	(test_vector_ops): Call test_vec_merge.

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index f77e1aa..13b2882 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5578,6 +5578,68 @@ simplify_cond_clz_ctz (rtx x, rtx_code cmp_code, rtx true_val, rtx false_val)
   return NULL_RTX;
 }
 
+/* Try to simplify nested VEC_MERGE operations by comparing the masks.  The
+   nested operations need not use the same vector mode, but must have the same
+   number of elements.
+
+   X is an operand number OP of a VEC_MERGE operation with MASK.
+   Returns NULL_RTX if no simplification is possible.  */
+
+rtx
+simplify_merge_mask (rtx x, rtx mask, int op)
+{
+  gcc_assert (VECTOR_MODE_P (GET_MODE (x)));
+  poly_uint64 nunits = GET_MODE_NUNITS (GET_MODE (x));
+  if (GET_CODE (x) == VEC_MERGE && rtx_equal_p (XEXP (x, 2), mask))
+    {
+      if (!side_effects_p (XEXP (x, 1 - op)))
+	return XEXP (x, op);
+    }
+  if (side_effects_p (x))
+    return NULL_RTX;
+  if (UNARY_P (x)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      if (top0)
+	return simplify_gen_unary (GET_CODE (x), GET_MODE (x), top0,
+				   GET_MODE (XEXP (x, 0)));
+    }
+  if (BINARY_P (x)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
+      if (top0 || top1)
+	return simplify_gen_binary (GET_CODE (x), GET_MODE (x),
+				    top0 ? top0 : XEXP (x, 0),
+				    top1 ? top1 : XEXP (x, 1));
+    }
+  if (GET_RTX_CLASS (GET_CODE (x)) == RTX_TERNARY
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 2)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 2))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
+      rtx top2 = simplify_merge_mask (XEXP (x, 2), mask, op);
+      if (top0 || top1)
+	return simplify_gen_ternary (GET_CODE (x), GET_MODE (x),
+				     GET_MODE (XEXP (x, 0)),
+				     top0 ? top0 : XEXP (x, 0),
+				     top1 ? top1 : XEXP (x, 1),
+				     top2 ? top2 : XEXP (x, 2));
+    }
+  return NULL_RTX;
+}
+
 \f
 /* Simplify CODE, an operation with result mode MODE and three operands,
    OP0, OP1, and OP2.  OP0_MODE was the mode of OP0 before it became
@@ -5967,6 +6029,16 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 	  && !side_effects_p (op2) && !side_effects_p (op1))
 	return op0;
 
+      if (!side_effects_p (op2))
+	{
+	  rtx top0 = simplify_merge_mask (op0, op2, 0);
+	  rtx top1 = simplify_merge_mask (op1, op2, 1);
+	  if (top0 || top1)
+	    return simplify_gen_ternary (code, mode, mode,
+					 top0 ? top0 : op0,
+					 top1 ? top1 : op1, op2);
+	}
+
       break;
 
     default:
@@ -6932,6 +7004,71 @@ test_vector_ops_series (machine_mode mode, rtx scalar_reg)
 					    constm1_rtx));
 }
 
+/* Verify simplify_merge_mask works correctly.  */
+
+static void
+test_vec_merge (machine_mode mode)
+{
+  rtx op0 = make_test_reg (mode);
+  rtx op1 = make_test_reg (mode);
+  rtx op2 = make_test_reg (mode);
+  rtx op3 = make_test_reg (mode);
+  rtx op4 = make_test_reg (mode);
+  rtx op5 = make_test_reg (mode);
+  rtx mask1 = make_test_reg (SImode);
+  rtx mask2 = make_test_reg (SImode);
+  rtx vm1 = gen_rtx_VEC_MERGE (mode, op0, op1, mask1);
+  rtx vm2 = gen_rtx_VEC_MERGE (mode, op2, op3, mask1);
+  rtx vm3 = gen_rtx_VEC_MERGE (mode, op4, op5, mask1);
+
+  /* Simple vec_merge.  */
+  ASSERT_EQ (op0, simplify_merge_mask (vm1, mask1, 0));
+  ASSERT_EQ (op1, simplify_merge_mask (vm1, mask1, 1));
+  ASSERT_EQ (NULL_RTX, simplify_merge_mask (vm1, mask2, 0));
+  ASSERT_EQ (NULL_RTX, simplify_merge_mask (vm1, mask2, 1));
+
+  /* Nested vec_merge.  */
+  rtx nvm = gen_rtx_VEC_MERGE (mode, vm1, vm2, mask1);
+  ASSERT_EQ (vm1, simplify_merge_mask (nvm, mask1, 0));
+  ASSERT_EQ (vm2, simplify_merge_mask (nvm, mask1, 1));
+
+  /* Intermediate unary op. */
+  rtx unop = gen_rtx_NOT (mode, vm1);
+  ASSERT_EQ (op0, XEXP (simplify_merge_mask (unop, mask1, 0), 0));
+  ASSERT_EQ (op1, XEXP (simplify_merge_mask (unop, mask1, 1), 0));
+
+  /* Intermediate binary op. */
+  rtx binop = gen_rtx_PLUS (mode, vm1, vm2);
+  rtx res = simplify_merge_mask (binop, mask1, 0);
+  ASSERT_EQ (op0, XEXP (res, 0));
+  ASSERT_EQ (op2, XEXP (res, 1));
+  res = simplify_merge_mask (binop, mask1, 1);
+  ASSERT_EQ (op1, XEXP (res, 0));
+  ASSERT_EQ (op3, XEXP (res, 1));
+
+  /* Intermediate ternary op. */
+  rtx tenop = gen_rtx_FMA (mode, vm1, vm2, vm3);
+  res = simplify_merge_mask (tenop, mask1, 0);
+  ASSERT_EQ (op0, XEXP (res, 0));
+  ASSERT_EQ (op2, XEXP (res, 1));
+  ASSERT_EQ (op4, XEXP (res, 2));
+  res = simplify_merge_mask (tenop, mask1, 1);
+  ASSERT_EQ (op1, XEXP (res, 0));
+  ASSERT_EQ (op3, XEXP (res, 1));
+  ASSERT_EQ (op5, XEXP (res, 2));
+
+  /* Side effects.  */
+  rtx badop0 = gen_rtx_PRE_INC (mode, op0);
+  rtx badvm = gen_rtx_VEC_MERGE (mode, badop0, op1, mask1);
+  ASSERT_EQ (badop0, simplify_merge_mask (badvm, mask1, 0));
+  ASSERT_EQ (NULL_RTX, simplify_merge_mask (badvm, mask1, 1));
+
+  /* Called indirectly.  */
+  res = simplify_rtx (nvm);
+  ASSERT_EQ (op0, XEXP (res, 0));
+  ASSERT_EQ (op3, XEXP (res, 1));
+}
+
 /* Verify some simplifications involving vectors.  */
 
 static void
@@ -6947,6 +7084,7 @@ test_vector_ops ()
 	  if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
 	      && maybe_gt (GET_MODE_NUNITS (mode), 2))
 	    test_vector_ops_series (mode, scalar_reg);
+	  test_vec_merge (mode);
 	}
     }
 }

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-19 16:38       ` Andrew Stubbs
  2018-09-19 22:27         ` Damian Rouson
@ 2018-09-20 15:59         ` Janne Blomqvist
  2018-09-20 16:37           ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Janne Blomqvist @ 2018-09-20 15:59 UTC (permalink / raw)
  To: ams; +Cc: Toon Moene, GCC Patches, Fortran List

On Wed, Sep 19, 2018 at 7:24 PM Andrew Stubbs <ams@codesourcery.com> wrote:

> On 05/09/18 19:07, Janne Blomqvist wrote:
> > The argument must be of type size_type_node, not sizetype. Please instead
> > use
> >
> > size = build_zero_cst (size_type_node);
> >
> >
> >>          * trans-intrinsic.c (conv_intrinsic_event_query): Convert
> computed
> >>          index to a size_t type.
> >>
> >
> > Using integer_type_node is wrong, but the correct type for calculating
> > array indices (lbound, ubound,  etc.) is not size_type_node but rather
> > gfc_array_index_type (which in practice maps to ptrdiff_t). So please use
> > that, and then fold_convert index to size_type_node just before
> generating
> > the call to event_query.
> >
> >
> >>          * trans-stmt.c (gfc_trans_event_post_wait): Likewise.
> >>
> >
> > Same here as above.
>
> How is the attached? I retested and found no regressions.
>
> Andrew
>

Ok, looks good.

There are some other remaining incorrect uses of integer_type_node (at
least one visible in the diff), but that can be done as a separate patch
(not saying you must do it as a precondition for anything, though it would
of course be nice if you would. :) )

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 22/25] Add dg-require-effective-target exceptions
  2018-09-17 17:53   ` Mike Stump
@ 2018-09-20 16:10     ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-20 16:10 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc-patches

On 17/09/18 18:51, Mike Stump wrote:
> On Sep 5, 2018, at 4:52 AM, ams@codesourcery.com wrote:
>> There are a number of tests that fail because they assume that exceptions are
>> available, but GCN does not support them, yet.
> 
> So, generally we don't goop up the testsuite with the day to day port stuff when it is being developed.  If the port is finished, and EH can't be done, this type of change is fine.  If someone plans on doing it in the next 5 years and the port is still being developed, there is likely little reason to do this.  People that track regressions do so by differencing, and that easily handles massive amounts of failures seamlessly.
> 
> So, my question would be, has is just not been worked on yet, or, is it basically impossible to ever do it?

It's not impossible, but there's no plan to implement it.

I'm just trying to avoid myself and others spending future hours 
triaging this stuff, again.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-20 15:59         ` [PATCH 08/25] Fix co-array allocation Janne Blomqvist
@ 2018-09-20 16:37           ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-20 16:37 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: Toon Moene, GCC Patches, Fortran List

On 20/09/18 16:56, Janne Blomqvist wrote:
> Ok, looks good.

Thanks.

> There are some other remaining incorrect uses of integer_type_node (at 
> least one visible in the diff), but that can be done as a separate patch 
> (not saying you must do it as a precondition for anything, though it 
> would of course be nice if you would. :) )

I'm not confident I can tell what should be integer_type_node, and what 
should not?

Once it gets to build_call_expr_loc it's clear that the types should 
match the function signature, but the intermediate values' types are not 
obvious to me.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-19 22:27         ` Damian Rouson
  2018-09-19 22:55           ` Andrew Stubbs
@ 2018-09-20 20:49           ` Thomas Koenig
  2018-09-20 20:59             ` Damian Rouson
                               ` (2 more replies)
  1 sibling, 3 replies; 187+ messages in thread
From: Thomas Koenig @ 2018-09-20 20:49 UTC (permalink / raw)
  To: Damian Rouson, ams; +Cc: Janne Blomqvist, Toon Moene, gcc patches, gfortran

Hi Damian,

> On a related note, two Sourcery Institute developers have attempted to edit
> the GCC build system to make the downloading and building of OpenCoarrays
> automatically part of the gfortran build process.  Neither developer
> succeeded.

We addressed integrating OpenCoarray into the gcc source tree at the
recent Gcc summit during the gfortran BoF session.

Feedback from people working for big Linux distributions was that they
would prefer to package OpenCoarrays as a separate library.
(They also mentioned it was quite hard to build.)

Maybe these people could use some help from you.

Regards

	Thomas

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-20 20:49           ` Thomas Koenig
@ 2018-09-20 20:59             ` Damian Rouson
  2018-09-21  7:38             ` Toon Moene
  2018-09-21 16:37             ` OpenCoarrays integration with gfortran Jerry DeLisle
  2 siblings, 0 replies; 187+ messages in thread
From: Damian Rouson @ 2018-09-20 20:59 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: ams, Janne Blomqvist, Toon Moene, gcc patches, gfortran

On Thu, Sep 20, 2018 at 1:01 PM Thomas Koenig <tkoenig@netcologne.de> wrote:

>
> We addressed integrating OpenCoarray into the gcc source tree at the
> recent Gcc summit during the gfortran BoF session.
>

I agree with keeping it as a separate code base, but comments from some
gfortran developers on the gfortran mailing list suggest that they liked
the idea of integrating the building of OpenCoarrays into the GCC build
system to simplify multi-image testing.


> Feedback from people working for big Linux distributions was that they
> would prefer to package OpenCoarrays as a separate library.
> (They also mentioned it was quite hard to build.)
>
> Maybe these people could use some help from you.
>

Thanks for the feedback.  Please feel free to put me in touch with them or
suggest that they submit issues on the OpenCoarrays repository.  We would
be glad to help.  We've put a lot of time into addressing installation
issues that have been submitted to us and we'll continue to do so if we
receive reports.

Damian

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-20 20:49           ` Thomas Koenig
  2018-09-20 20:59             ` Damian Rouson
@ 2018-09-21  7:38             ` Toon Moene
  2018-09-23 11:57               ` Janne Blomqvist
  2018-09-21 16:37             ` OpenCoarrays integration with gfortran Jerry DeLisle
  2 siblings, 1 reply; 187+ messages in thread
From: Toon Moene @ 2018-09-21  7:38 UTC (permalink / raw)
  To: Thomas Koenig, Damian Rouson, ams; +Cc: Janne Blomqvist, gcc patches, gfortran

On 09/20/2018 10:01 PM, Thomas Koenig wrote:

> Hi Damian,
> 
>> On a related note, two Sourcery Institute developers have attempted to 
>> edit
>> the GCC build system to make the downloading and building of OpenCoarrays
>> automatically part of the gfortran build process.  Neither developer
>> succeeded.
> 
> We addressed integrating OpenCoarray into the gcc source tree at the
> recent Gcc summit during the gfortran BoF session.
> 
> Feedback from people working for big Linux distributions was that they
> would prefer to package OpenCoarrays as a separate library.
> (They also mentioned it was quite hard to build.)

Well, Linux distributors have to fit the build of OpenCoarrays into 
*their* build system, which might be just as complicated as we trying it 
to force it into *gcc's* build system ...

For an individual, OpenCoarrays is not hard to build, and the web page 
www.opencoarrays.org offers multiple solutions:

"Installation via package management is generally the easiest and most 
reliable option.   See below for the package-management installation 
options for Linux, macOS, and FreeBSD.  Alternatively, download and 
build the latest OpenCoarrays release  via the contained installation 
scripts or with CMake."

I choose the cmake based one, because I already had cmake installed to 
be able to build ECMWF's (ecmwf.int) eccodes package. It probably helped 
that I also already had openmpi installed. From my command history:

  1754  tar zxvf ~/Downloads/OpenCoarrays-2.2.0.tar.gz
  1755  cd OpenCoarrays-2.2.0/
  1756  ls
  1757  less README.md
  1758  cd ..
  1759  mkdir opencoarrays-build
  1760  cd opencoarrays-build
  1761  (export FC=gfortran; export CC=gcc; cmake ../OpenCoarrays-2.2.0/ 
-DCMAKE_INSTALL_PREFIX=$HOME/opencoarrays)
  1762  make
  1763  make test
  1764  make install

After that, it was a breeze to test my mock weather program 
(moene.org/~toon/random-weather.f90), that I had built until then only 
with -fcoarray=single.

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 187+ messages in thread

* OpenCoarrays integration with gfortran
  2018-09-20 20:49           ` Thomas Koenig
  2018-09-20 20:59             ` Damian Rouson
  2018-09-21  7:38             ` Toon Moene
@ 2018-09-21 16:37             ` Jerry DeLisle
  2018-09-21 19:37               ` Janne Blomqvist
                                 ` (2 more replies)
  2 siblings, 3 replies; 187+ messages in thread
From: Jerry DeLisle @ 2018-09-21 16:37 UTC (permalink / raw)
  To: Thomas Koenig, Damian Rouson, ams
  Cc: Janne Blomqvist, Toon Moene, gcc patches, gfortran

My apologies for kidnapping this thread:
On 9/20/18 1:01 PM, Thomas Koenig wrote:
> Hi Damian,
> 
>> On a related note, two Sourcery Institute developers have attempted to 
>> edit
>> the GCC build system to make the downloading and building of OpenCoarrays
>> automatically part of the gfortran build process.  Neither developer
>> succeeded.
> 
> We addressed integrating OpenCoarray into the gcc source tree at the
> recent Gcc summit during the gfortran BoF session.
> 
> Feedback from people working for big Linux distributions was that they
> would prefer to package OpenCoarrays as a separate library.
> (They also mentioned it was quite hard to build.)

I would like to put in my humble 2 cents worth here.

OpenCoarrays was/is intended for a very broad audience, various large 
systems such as Cray, etc. I think this influenced heavily the path of 
its development, which is certainly OK.

It was/is intended to interface libraries such as OpenMPI or MPICH to 
gfortran as well as other Fortran compilers.

The actual library source code is contained mostly in one source file. 
After all the attempts to integrate into the GNU build systems without 
much success my thinking has shifted. Keep in mind that the OpenCoarrays 
implementation is quite dependent on gfortran and in fact has to do 
special things in the build dependent on the version of gcc/gfortran a 
user happens to use.  I dont think this is a good situation.

So I see two realistic strategies.  The first is already talked about a 
lot and is the cleanest approach for gfortran:

1) Focus on distribution packages such as Fedora, Debian, Ubuntu, 
Windows, etc. Building of these packages needs to be automated into the 
distributions. I think mostly this is what is happening and relies on 
the various distribution maintainers to do so.  Their support is greatly 
appreciated and this really is the cleanest approach.

The second option is not discussed as much because it leaves 
OpenCoarrays behind in a sense and requires an editing cycle in two 
places to fix bugs or add features.

2) Take the one source file, edit out all the macros that define 
prefixes to function calls, hard code the gfortran prefixes etc and fork 
it directly into the libgfortran library under GPL with attributions to 
the original developers as appropriate.

Strategy 2 would lock into specific current standard versions of the MPI 
interface and would support less bleeding edge changes.  It would also 
require either OpenMPI or MPICH as a new gfortran dependency for 
building, which not all users may need. So we would need some 
configuration magic to enable or disable this portion of the build. 
Something like --with-MPI-support would do the trick.

Strategy 2 does add burden to gfortran maintainers who are already 
overloaded. But, as the code matures the burden would decrease, 
particularly once TEAMS are finished.

Strategy 2 does have some advantages. For example, eliminating the need 
for separate CAF and CAFRUN scripts which are a wrapper on gfortran. 
The coarray features are part of the Fortran language and gfortran 
should just "handle it" transparently using an environment variable to 
define the number of images at run time. It would also actually 
eliminate the need to manage all of the separate distribution packages. 
So from a global point of view the overall maintanance effort would be 
reduced.

Strategy 2 would enable a set of users who are not focused so much on 
distributions and loading packages, etc etc and those who are dependent 
on getting through bureaucratic administrations who already are loading 
gfortran on systems and would not have to also get another package 
approved.  People would just have to stop thinking about it and just use it.

So I think there are real advantages to Strategy 2 as well as Strategy 1 
and think it should be at least included in discussions. I would even 
suggest there is likely a combination of 1 and 2 that may hit the mark. 
For example, keeping OpenCoarrays as a separate package for bleeding 
edge development and migrating the stable features into libgfortran on a 
less frequent cycle.

As I said, my 2 cents worth.

Regards to all,

Jerry







^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-21 16:37             ` OpenCoarrays integration with gfortran Jerry DeLisle
@ 2018-09-21 19:37               ` Janne Blomqvist
  2018-09-21 19:44               ` Richard Biener
  2018-09-21 20:25               ` Damian Rouson
  2 siblings, 0 replies; 187+ messages in thread
From: Janne Blomqvist @ 2018-09-21 19:37 UTC (permalink / raw)
  To: jerry DeLisle
  Cc: Thomas Koenig, Damian Rouson, ams, Toon Moene, GCC Patches, Fortran List

On Fri, Sep 21, 2018 at 7:25 PM Jerry DeLisle <jvdelisle@charter.net> wrote:

> My apologies for kidnapping this thread:
> On 9/20/18 1:01 PM, Thomas Koenig wrote:
> > Hi Damian,
> >
> >> On a related note, two Sourcery Institute developers have attempted to
> >> edit
> >> the GCC build system to make the downloading and building of
> OpenCoarrays
> >> automatically part of the gfortran build process.  Neither developer
> >> succeeded.
> >
> > We addressed integrating OpenCoarray into the gcc source tree at the
> > recent Gcc summit during the gfortran BoF session.
> >
> > Feedback from people working for big Linux distributions was that they
> > would prefer to package OpenCoarrays as a separate library.
> > (They also mentioned it was quite hard to build.)
>
> I would like to put in my humble 2 cents worth here.
>
> OpenCoarrays was/is intended for a very broad audience, various large
> systems such as Cray, etc. I think this influenced heavily the path of
> its development, which is certainly OK.
>
> It was/is intended to interface libraries such as OpenMPI or MPICH to
> gfortran as well as other Fortran compilers.
>
> The actual library source code is contained mostly in one source file.
> After all the attempts to integrate into the GNU build systems without
> much success my thinking has shifted. Keep in mind that the OpenCoarrays
> implementation is quite dependent on gfortran and in fact has to do
> special things in the build dependent on the version of gcc/gfortran a
> user happens to use.  I dont think this is a good situation.
>
> So I see two realistic strategies.  The first is already talked about a
> lot and is the cleanest approach for gfortran:
>
> 1) Focus on distribution packages such as Fedora, Debian, Ubuntu,
> Windows, etc. Building of these packages needs to be automated into the
> distributions. I think mostly this is what is happening and relies on
> the various distribution maintainers to do so.  Their support is greatly
> appreciated and this really is the cleanest approach.
>
> The second option is not discussed as much because it leaves
> OpenCoarrays behind in a sense and requires an editing cycle in two
> places to fix bugs or add features.
>
> 2) Take the one source file, edit out all the macros that define
> prefixes to function calls, hard code the gfortran prefixes etc and fork
> it directly into the libgfortran library under GPL with attributions to
> the original developers as appropriate.
>
> Strategy 2 would lock into specific current standard versions of the MPI
> interface and would support less bleeding edge changes.  It would also
> require either OpenMPI or MPICH as a new gfortran dependency for
> building, which not all users may need. So we would need some
> configuration magic to enable or disable this portion of the build.
> Something like --with-MPI-support would do the trick.
>
> Strategy 2 does add burden to gfortran maintainers who are already
> overloaded. But, as the code matures the burden would decrease,
> particularly once TEAMS are finished.
>
> Strategy 2 does have some advantages. For example, eliminating the need
> for separate CAF and CAFRUN scripts which are a wrapper on gfortran.
> The coarray features are part of the Fortran language and gfortran
> should just "handle it" transparently using an environment variable to
> define the number of images at run time. It would also actually
> eliminate the need to manage all of the separate distribution packages.
> So from a global point of view the overall maintanance effort would be
> reduced.
>
> Strategy 2 would enable a set of users who are not focused so much on
> distributions and loading packages, etc etc and those who are dependent
> on getting through bureaucratic administrations who already are loading
> gfortran on systems and would not have to also get another package
> approved.  People would just have to stop thinking about it and just use
> it.
>
> So I think there are real advantages to Strategy 2 as well as Strategy 1
> and think it should be at least included in discussions. I would even
> suggest there is likely a combination of 1 and 2 that may hit the mark.
> For example, keeping OpenCoarrays as a separate package for bleeding
> edge development and migrating the stable features into libgfortran on a
> less frequent cycle.
>
> As I said, my 2 cents worth.
>
> Regards to all,
>
> Jerry
>
>
I recall one motivation for the current sort-of loose coupling between the
coarray library and gfortran was to support, at runtime, different MPI
libraries.  This can be useful on cluster and supercomputers, where it's
important to use a MPI library that can use the high-performance cluster
network. If libgfortran includes the coarray library which links against a
MPI library, it means libgfortran has to be rebuilt against every MPI
library in use on a system, and most likely, one cannot use the
distro-provided gfortran.  This might not be insurmountable on cluster
using some kind of module system, but still.

I guess it might be possible to use weak symbols, like we currently use for
some things in libgfortran (e.g. clock_gettime), but that would mean a
quite big diff compared to upstream OpenCoarrays. And how to handle targets
that don't support weak symbols in some sane fashion, etc.

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-21 16:37             ` OpenCoarrays integration with gfortran Jerry DeLisle
  2018-09-21 19:37               ` Janne Blomqvist
@ 2018-09-21 19:44               ` Richard Biener
  2018-09-21 20:25               ` Damian Rouson
  2 siblings, 0 replies; 187+ messages in thread
From: Richard Biener @ 2018-09-21 19:44 UTC (permalink / raw)
  To: gcc-patches, Jerry DeLisle, Thomas Koenig, Damian Rouson, ams
  Cc: Janne Blomqvist, Toon Moene, gcc patches, gfortran

On September 21, 2018 6:24:45 PM GMT+02:00, Jerry DeLisle <jvdelisle@charter.net> wrote:
>My apologies for kidnapping this thread:
>On 9/20/18 1:01 PM, Thomas Koenig wrote:
>> Hi Damian,
>> 
>>> On a related note, two Sourcery Institute developers have attempted
>to 
>>> edit
>>> the GCC build system to make the downloading and building of
>OpenCoarrays
>>> automatically part of the gfortran build process.  Neither developer
>>> succeeded.
>> 
>> We addressed integrating OpenCoarray into the gcc source tree at the
>> recent Gcc summit during the gfortran BoF session.
>> 
>> Feedback from people working for big Linux distributions was that
>they
>> would prefer to package OpenCoarrays as a separate library.
>> (They also mentioned it was quite hard to build.)
>
>I would like to put in my humble 2 cents worth here.
>
>OpenCoarrays was/is intended for a very broad audience, various large 
>systems such as Cray, etc. I think this influenced heavily the path of 
>its development, which is certainly OK.
>
>It was/is intended to interface libraries such as OpenMPI or MPICH to 
>gfortran as well as other Fortran compilers.
>
>The actual library source code is contained mostly in one source file. 
>After all the attempts to integrate into the GNU build systems without 
>much success my thinking has shifted. Keep in mind that the
>OpenCoarrays 
>implementation is quite dependent on gfortran and in fact has to do 
>special things in the build dependent on the version of gcc/gfortran a 
>user happens to use.  I dont think this is a good situation.
>
>So I see two realistic strategies.  The first is already talked about a
>
>lot and is the cleanest approach for gfortran:
>
>1) Focus on distribution packages such as Fedora, Debian, Ubuntu, 
>Windows, etc. Building of these packages needs to be automated into the
>
>distributions. I think mostly this is what is happening and relies on 
>the various distribution maintainers to do so.  Their support is
>greatly 
>appreciated and this really is the cleanest approach.
>
>The second option is not discussed as much because it leaves 
>OpenCoarrays behind in a sense and requires an editing cycle in two 
>places to fix bugs or add features.
>
>2) Take the one source file, edit out all the macros that define 
>prefixes to function calls, hard code the gfortran prefixes etc and
>fork 
>it directly into the libgfortran library under GPL with attributions to
>
>the original developers as appropriate.
>
>Strategy 2 would lock into specific current standard versions of the
>MPI 
>interface and would support less bleeding edge changes.  It would also 
>require either OpenMPI or MPICH as a new gfortran dependency for 
>building, which not all users may need. So we would need some 
>configuration magic to enable or disable this portion of the build. 
>Something like --with-MPI-support would do the trick.
>
>Strategy 2 does add burden to gfortran maintainers who are already 
>overloaded. But, as the code matures the burden would decrease, 
>particularly once TEAMS are finished.
>
>Strategy 2 does have some advantages. For example, eliminating the need
>
>for separate CAF and CAFRUN scripts which are a wrapper on gfortran. 
>The coarray features are part of the Fortran language and gfortran 
>should just "handle it" transparently using an environment variable to 
>define the number of images at run time. It would also actually 
>eliminate the need to manage all of the separate distribution packages.
>
>So from a global point of view the overall maintanance effort would be 
>reduced.
>
>Strategy 2 would enable a set of users who are not focused so much on 
>distributions and loading packages, etc etc and those who are dependent
>
>on getting through bureaucratic administrations who already are loading
>
>gfortran on systems and would not have to also get another package 
>approved.  People would just have to stop thinking about it and just
>use it.
>
>So I think there are real advantages to Strategy 2 as well as Strategy
>1 
>and think it should be at least included in discussions. I would even 
>suggest there is likely a combination of 1 and 2 that may hit the mark.
>
>For example, keeping OpenCoarrays as a separate package for bleeding 
>edge development and migrating the stable features into libgfortran on
>a 
>less frequent cycle.

Sounds reasonable to me. License issues will be the most difficult here given integration with libgfortran likely requires a FSF copyright rather than just a compatible license. 

Richard. 

>As I said, my 2 cents worth.
>
>Regards to all,
>
>Jerry

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-21 16:37             ` OpenCoarrays integration with gfortran Jerry DeLisle
  2018-09-21 19:37               ` Janne Blomqvist
  2018-09-21 19:44               ` Richard Biener
@ 2018-09-21 20:25               ` Damian Rouson
  2018-09-22  3:47                 ` Jerry DeLisle
  2 siblings, 1 reply; 187+ messages in thread
From: Damian Rouson @ 2018-09-21 20:25 UTC (permalink / raw)
  To: Jerry DeLisle
  Cc: Thomas Koenig, ams, Janne Blomqvist, Toon Moene, gcc patches, gfortran

On Fri, Sep 21, 2018 at 9:25 AM Jerry DeLisle <jvdelisle@charter.net> wrote:

> The actual library source code is contained mostly in one source file.

There are as many files as there are options for the underlying
parallel programming
model.  The default is MPI, but I've co-authored conference papers last year
and this year in which the OpenCoarrays OpenSHEM option outperformed MPI.
One paper even described a platform on which OpenSHMEM was the only option
beyond a few thousand cores because the required MPI features were immature on
that platform.  Early versions of OpenCoarrays also provided GASNet
and ARMCI options.
I recommend against tying gfortran to MPI only.

> After all the attempts to integrate into the GNU build systems without
> much success my thinking has shifted.

Thanks for all your efforts!

> Keep in mind that the OpenCoarrays
> implementation is quite dependent on gfortran and in fact has to do
> special things in the build dependent on the version of gcc/gfortran a
> user happens to use.  I dont think this is a good situation.

I agree.  Possibly OpenCoarrays could drop support for older gfortran versions
at some point to avoid maintaining code that exists solely to support compiler
versions that are several years old.

>
> 1) Focus on distribution packages such as Fedora, Debian, Ubuntu,
> Windows, etc. Building of these packages needs to be automated into the
> distributions.

This is the option that the OpenCoarrays documentation recommends as easiest for
most users.

> 2) Take the one source file, edit out all the macros that define
> prefixes to function calls, hard code the gfortran prefixes etc and fork
> it directly into the libgfortran library under GPL with attributions to
> the original developers as appropriate.

See above.   Also, this means that changes in the gfortran repository would not
propagate back upstream unless each gfortran developer agrees to
distribute his or her
work under both GPL and BSD.  Even that is only feasible if the copied
files stay cohesive
and don't reference code outside the copied file.  I think it's more
likely that copying the code
into gfortran would be a branch point, after which the relevant files
would diverge and
work on the GPL side would be harder to fund than the BSD side.

Most commercial entities are more likely to contribute to a
BSD-licensed project than a
GPL-licensed one.  Over the past several months, one commercial compiler vendor
authorized one of their developers to contribute to OpenCoarrays. and
another commercial
compiler vendor invited community input on whether to use OpenCoarrays
during a public
teleconference.  The prospect of commercial support is the motivation
for using BSD.

> Strategy 2 does have some advantages. For example, eliminating the need
> for separate CAF and CAFRUN scripts which are a wrapper on gfortran.

Even in the case of just one underlying parallel programming model,
this is tricky.  To wit, Cray uses
a compiler wrapper and a program launcher.  Intel was able to
eliminate the compiler wrapper,
but still required a program launcher for distributed-memory execution
until recently.  I don't
know the details, but I've heard it was not trivial for Intel to
accomplish this and I imagine it would be
even more complicated if they weren't hardwiring Intel MPI into their back-end.

> People would just have to stop thinking about it and just use it.

The same would be true if someone could coax the GCC build system to
build OpenCoarrays
just as it builds other prerequisites.  The big difference is that
OpenCoarrays is a prerequisite
for using gfortran rather than for building gfortran so it needs to be
built after gfortran rather
than before like other prerequisites.  The real problem is finding
anyone who can work the
proper magic in the GCC build system.

Thanks for your input.  I hope my response is helpful.

Damian

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-21 20:25               ` Damian Rouson
@ 2018-09-22  3:47                 ` Jerry DeLisle
  2018-09-23 10:41                   ` Toon Moene
  0 siblings, 1 reply; 187+ messages in thread
From: Jerry DeLisle @ 2018-09-22  3:47 UTC (permalink / raw)
  To: Damian Rouson
  Cc: Thomas Koenig, ams, Janne Blomqvist, Toon Moene, gcc patches, gfortran

On 9/21/18 1:16 PM, Damian Rouson wrote:> On Fri, Sep 21, 2018 at 9:25 
AM Jerry DeLisle <jvdelisle@charter.net> wrote:
 >
 >> The actual library source code is contained mostly in one source file.
 >
 > There are as many files as there are options for the underlying
 > parallel programming
 > model.  The default is MPI, but I've co-authored conference papers 
last year
 > and this year in which the OpenCoarrays OpenSHEM option outperformed MPI.
 > One paper even described a platform on which OpenSHMEM was the only 
option
 > beyond a few thousand cores because the required MPI features were 
immature on
 > that platform.  Early versions of OpenCoarrays also provided GASNet
 > and ARMCI options.
 > I recommend against tying gfortran to MPI only.

I agree with you on this point. Perhaps the Opencoarrays implementation 
should somehow do some runtime introspection to allow the library to 
sync to whatever is desired on a given system. The gfortran interface 
was designed to be generic. Implementation should be more dynamic in run 
time linking and abstracted in such a way that OpenCoarrays could be 
compiled stand alone and use something like "plugins" to allow post 
build the determination of what which interface to use.

I am by no means a software expert in these techniques, but they are 
becoming common practice in other areas, for example linux/Gnu kernel 
modules

 >
 >> After all the attempts to integrate into the GNU build systems without
 >> much success my thinking has shifted.
 >
 > Thanks for all your efforts!
 >
 >> Keep in mind that the OpenCoarrays
 >> implementation is quite dependent on gfortran and in fact has to do
 >> special things in the build dependent on the version of gcc/gfortran a
 >> user happens to use.  I dont think this is a good situation.
 >
 > I agree.  Possibly OpenCoarrays could drop support for older gfortran 
versions
 > at some point to avoid maintaining code that exists solely to support 
compiler
 > versions that are several years old.

See my comments above about pluggable modules.  Maybe libgfortran should 
have this pluggable interface and Opencoarrays provide the plugins. 
Think how useful it would be to be able to choose the backend at time of 
execution based on a simple envoronment variable set by the user.

 >
 >>
 >> 1) Focus on distribution packages such as Fedora, Debian, Ubuntu,
 >> Windows, etc. Building of these packages needs to be automated into the
 >> distributions.
 >
 > This is the option that the OpenCoarrays documentation recommends as 
easiest for
 > most users.

Agree.

 >
 >> 2) Take the one source file, edit out all the macros that define
 >> prefixes to function calls, hard code the gfortran prefixes etc and fork
 >> it directly into the libgfortran library under GPL with attributions to
 >> the original developers as appropriate.
 >
 > See above.   Also, this means that changes in the gfortran repository 
would not
 > propagate back upstream unless each gfortran developer agrees to
 > distribute his or her
 > work under both GPL and BSD.  Even that is only feasible if the copied
 > files stay cohesive

The flip of this would be to have the OpenCorrays developers to agree to 
the GPL and release under both. The libgfortran license says:

"Under Section 7 of GPL version 3, you are granted additional
permissions described in the GCC Runtime Library Exception, version
3.1, as published by the Free Software Foundation."

Probably worth a fresh look.

 > and don't reference code outside the copied file.  I think it's more
 > likely that copying the code
 > into gfortran would be a branch point, after which the relevant files
 > would diverge and
 > work on the GPL side would be harder to fund than the BSD side.
 >
 > Most commercial entities are more likely to contribute to a
 > BSD-licensed project than a
 > GPL-licensed one.  Over the past several months, one commercial 
compiler vendor
 > authorized one of their developers to contribute to OpenCoarrays. and
 > another commercial
 > compiler vendor invited community input on whether to use OpenCoarrays
 > during a public
 > teleconference.  The prospect of commercial support is the motivation
 > for using BSD.

I really have no commercial interest. So I will not comment on GPL vs 
BSD other than referring to the multitude of FSF recommendations about 
why one should choose one of the FSF flavors rather than BSD.

 >
 >> Strategy 2 does have some advantages. For example, eliminating the need
 >> for separate CAF and CAFRUN scripts which are a wrapper on gfortran.
 >
 > Even in the case of just one underlying parallel programming model,
 > this is tricky.  To wit, Cray uses
 > a compiler wrapper and a program launcher.  Intel was able to
 > eliminate the compiler wrapper,
 > but still required a program launcher for distributed-memory execution
 > until recently.  I don't
 > know the details, but I've heard it was not trivial for Intel to
 > accomplish this and I imagine it would be
 > even more complicated if they weren't hardwiring Intel MPI into their 
back-end.

Well here is one commercial entity that did not shy away from 
'hardwiring' MPI, Regardless, using plugins would resolve concerns about 
which MPI to use or whether to use shared memory or some other model.

 >
 >> People would just have to stop thinking about it and just use it.
 >
 > The same would be true if someone could coax the GCC build system to
 > build OpenCoarrays
 > just as it builds other prerequisites.  The big difference is that
 > OpenCoarrays is a prerequisite
 > for using gfortran rather than for building gfortran so it needs to be
 > built after gfortran rather
 > than before like other prerequisites.  The real problem is finding
 > anyone who can work the
 > proper magic in the GCC build system.

I dont see this as the real problem. The forking idea would resolve this 
fairly easily.  Above, you mentioned concern about locking into MPI. Do 
the packaged versions of OpenCoarrays not lock into MPI, either OpenMPI 
or MPICH? I have not tried one yet since I am waiting for the Fedora one 
to hit the release.

If the tight coupling is needed maybe there ought to be a set of 
libraries or modules, one for each "backend". (Back to my pluggable 
modules concept)  The more I think about it the more I think this is the 
fundamental design issue.

 >
 > Thanks for your input.  I hope my response is helpful.
 >
 > Damian
 >

As always, best regards.

Jerry



^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2018-09-05 11:49 ` [PATCH 02/25] Propagate address spaces to builtins ams
  2018-09-20 13:09   ` Richard Biener
@ 2018-09-22 19:22   ` Andreas Schwab
  2018-09-24 16:53     ` Andrew Stubbs
  2018-09-25 14:27     ` [patch] Fix AArch64 ILP ICE Andrew Stubbs
  2019-09-03 14:01   ` [PATCH 02/25] Propagate address spaces to builtins Kyrill Tkachov
  2 siblings, 2 replies; 187+ messages in thread
From: Andreas Schwab @ 2018-09-22 19:22 UTC (permalink / raw)
  To: ams; +Cc: gcc-patches

On Sep 05 2018, <ams@codesourcery.com> wrote:

> At present, pointers passed to builtin functions, including atomic operators,
> are stripped of their address space properties.  This doesn't seem to be
> deliberate, it just omits to copy them.
>
> Not only that, but it forces pointer sizes to Pmode, which isn't appropriate
> for all address spaces.
>
> This patch attempts to correct both issues.  It works for GCN atomics and
> GCN OpenACC gang-private variables.
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 	    Julian Brown  <julian@codesourcery.com>
>
> 	gcc/
> 	* builtins.c (get_builtin_sync_mem): Handle address spaces.

That breaks aarch64 ILP32.

../../../../libgomp/ordered.c: In function 'GOMP_doacross_wait':
../../../../libgomp/ordered.c:486:1: internal compiler error: output_operand: invalid address mode
486 | }
    | ^
0xf7219f aarch64_print_address_internal
        ../../gcc/config/aarch64/aarch64.c:7163
0xf7269b aarch64_print_operand_address
        ../../gcc/config/aarch64/aarch64.c:7267
0x8871ef output_address(machine_mode, rtx_def*)
        ../../gcc/final.c:4069
0xf7302b aarch64_print_operand
        ../../gcc/config/aarch64/aarch64.c:6952
0x887133 output_operand(rtx_def*, int)
        ../../gcc/final.c:4053
0x887b5b output_asm_insn(char const*, rtx_def**)
        ../../gcc/final.c:3965
0x889157 output_asm_insn(char const*, rtx_def**)
        ../../gcc/final.c:3842
0x889157 final_scan_insn_1
        ../../gcc/final.c:3103
0x88984b final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
        ../../gcc/final.c:3149
0x889b1f final_1
        ../../gcc/final.c:2019
0x88aaff rest_of_handle_final
        ../../gcc/final.c:4660
0x88aaff execute
        ../../gcc/final.c:4734

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-22  3:47                 ` Jerry DeLisle
@ 2018-09-23 10:41                   ` Toon Moene
  2018-09-23 18:03                     ` Bernhard Reutner-Fischer
  2018-09-24 11:14                     ` Alastair McKinstry
  0 siblings, 2 replies; 187+ messages in thread
From: Toon Moene @ 2018-09-23 10:41 UTC (permalink / raw)
  To: Jerry DeLisle, Damian Rouson
  Cc: Thomas Koenig, ams, Janne Blomqvist, gcc patches, gfortran

On 09/22/2018 01:23 AM, Jerry DeLisle wrote:

> On 9/21/18 1:16 PM, Damian Rouson wrote:> On Fri, Sep 21, 2018 at 9:25 
> AM Jerry DeLisle <jvdelisle@charter.net> wrote:

>  >> 1) Focus on distribution packages such as Fedora, Debian, Ubuntu,
>  >> Windows, etc. Building of these packages needs to be automated into the
>  >> distributions.
>  >
>  > This is the option that the OpenCoarrays documentation recommends as 
> easiest for
>  > most users.
> 
> Agree.

I just installed opencoarrays on my system at home (Debian Testing):

root@moene:~# apt-get install libcoarrays-openmpi-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
   libcaf-openmpi-3
The following NEW packages will be installed:
   libcaf-openmpi-3 libcoarrays-openmpi-dev
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 107 kB of archives.
After this operation, 317 kB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 http://ftp.nl.debian.org/debian testing/main amd64 
libcaf-openmpi-3 amd64 2.2.0-3 [38.2 kB]
Get:2 http://ftp.nl.debian.org/debian testing/main amd64 
libcoarrays-openmpi-dev amd64 2.2.0-3 [68.9 kB]
Fetched 107 kB in 0s (634 kB/s)
Selecting previously unselected package libcaf-openmpi-3:amd64.
(Reading database ... 212249 files and directories currently installed.)
Preparing to unpack .../libcaf-openmpi-3_2.2.0-3_amd64.deb ...
Unpacking libcaf-openmpi-3:amd64 (2.2.0-3) ...
Selecting previously unselected package libcoarrays-openmpi-dev:amd64.
Preparing to unpack .../libcoarrays-openmpi-dev_2.2.0-3_amd64.deb ...
Unpacking libcoarrays-openmpi-dev:amd64 (2.2.0-3) ...
Setting up libcaf-openmpi-3:amd64 (2.2.0-3) ...
Setting up libcoarrays-openmpi-dev:amd64 (2.2.0-3) ...
Processing triggers for libc-bin (2.27-6) ...

[ previously this led to apt errors, but not now. ]

and moved my own installation of the OpenCoarrays-2.2.0.tar.gz out of 
the way:

toon@moene:~$ ls -ld *pen*
drwxr-xr-x 6 toon toon 4096 Aug 10 16:01 OpenCoarrays-2.2.0.opzij
drwxr-xr-x 8 toon toon 4096 Sep 15 11:26 opencoarrays-build.opzij
drwxr-xr-x 6 toon toon 4096 Sep 15 11:26 opencoarrays.opzij

and recompiled my stuff:

gfortran -g -fbacktrace -fcoarray=lib random-weather.f90 
-L/usr/lib/x86_64-linux-gnu/open-coarrays/openmpi/lib -lcaf_mpi

[ Yes, the location of the libs is quite experimental, but OK for the 
"Testing" variant of Debian ... ]

I couldn't find cafrun, but mpirun works just fine:

toon@moene:~/src$ echo ' &config /' | mpirun --oversubscribe --bind-to 
none -np 20 ./a.out
Decomposition information on image    7 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image    6 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   11 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   15 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image    1 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   13 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   12 is    4 *    5 slabs with   21 * 
   18 grid cells on this image.
Decomposition information on image   20 is    4 *    5 slabs with   21 * 
   18 grid cells on this image.
Decomposition information on image    9 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   14 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   16 is    4 *    5 slabs with   21 * 
   18 grid cells on this image.
Decomposition information on image   17 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   18 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image    2 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image    4 is    4 *    5 slabs with   21 * 
   18 grid cells on this image.
Decomposition information on image    5 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image    3 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image    8 is    4 *    5 slabs with   21 * 
   18 grid cells on this image.
Decomposition information on image   10 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.
Decomposition information on image   19 is    4 *    5 slabs with   23 * 
   18 grid cells on this image.

... etc. (see http://moene.org/~toon/random-weather.f90).

I presume other Linux distributors will follow shortly (this *is* Debian 
Testing, which can be a bit testy at times - but I do trust my main 
business at home on it for over 15 years now).

Kind regards,

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 08/25] Fix co-array allocation
  2018-09-21  7:38             ` Toon Moene
@ 2018-09-23 11:57               ` Janne Blomqvist
  0 siblings, 0 replies; 187+ messages in thread
From: Janne Blomqvist @ 2018-09-23 11:57 UTC (permalink / raw)
  To: Toon Moene; +Cc: Thomas Koenig, Damian Rouson, ams, GCC Patches, Fortran List

On Fri, Sep 21, 2018 at 10:33 AM Toon Moene <toon@moene.org> wrote:

> On 09/20/2018 10:01 PM, Thomas Koenig wrote:
>
> > Hi Damian,
> >
> >> On a related note, two Sourcery Institute developers have attempted to
> >> edit
> >> the GCC build system to make the downloading and building of
> OpenCoarrays
> >> automatically part of the gfortran build process.  Neither developer
> >> succeeded.
> >
> > We addressed integrating OpenCoarray into the gcc source tree at the
> > recent Gcc summit during the gfortran BoF session.
> >
> > Feedback from people working for big Linux distributions was that they
> > would prefer to package OpenCoarrays as a separate library.
> > (They also mentioned it was quite hard to build.)
>
> Well, Linux distributors have to fit the build of OpenCoarrays into
> *their* build system, which might be just as complicated as we trying it
> to force it into *gcc's* build system ...
>
> For an individual, OpenCoarrays is not hard to build, and the web page
> www.opencoarrays.org offers multiple solutions:
>
> "Installation via package management is generally the easiest and most
> reliable option.   See below for the package-management installation
> options for Linux, macOS, and FreeBSD.  Alternatively, download and
> build the latest OpenCoarrays release  via the contained installation
> scripts or with CMake."
>
> I choose the cmake based one, because I already had cmake installed to
> be able to build ECMWF's (ecmwf.int) eccodes package. It probably helped
> that I also already had openmpi installed. From my command history:
>
>   1754  tar zxvf ~/Downloads/OpenCoarrays-2.2.0.tar.gz
>   1755  cd OpenCoarrays-2.2.0/
>   1756  ls
>   1757  less README.md
>   1758  cd ..
>   1759  mkdir opencoarrays-build
>   1760  cd opencoarrays-build
>   1761  (export FC=gfortran; export CC=gcc; cmake ../OpenCoarrays-2.2.0/
> -DCMAKE_INSTALL_PREFIX=$HOME/opencoarrays)
>   1762  make
>   1763  make test
>   1764  make install
>

FWIW, this didn't work for me, as I want to use my own build of gfortran
trunk. It did correctly use the correct gfortran binary as specified by the
FC env. variable, but it still insists on linking against libgfortran.so.4
(installed by the system package manager) and not the libgfortran.so.5 from
my own gfortran installation (found both on LD_RUN_PATH and
LD_LIBRARY_PATH).  I tried -DCMAKE_PREFIX_PATH=... but that didn't work any
better. Gah, I hate cmake..

Any ideas?

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-23 10:41                   ` Toon Moene
@ 2018-09-23 18:03                     ` Bernhard Reutner-Fischer
  2018-09-24 11:14                     ` Alastair McKinstry
  1 sibling, 0 replies; 187+ messages in thread
From: Bernhard Reutner-Fischer @ 2018-09-23 18:03 UTC (permalink / raw)
  To: gcc-patches, Toon Moene, Jerry DeLisle, Damian Rouson
  Cc: Thomas Koenig, ams, Janne Blomqvist, gcc patches, gfortran

On 23 September 2018 11:46:57 CEST, Toon Moene <toon@moene.org> wrote:
>On 09/22/2018 01:23 AM, Jerry DeLisle wrote:
>
>> On 9/21/18 1:16 PM, Damian Rouson wrote:> On Fri, Sep 21, 2018 at
>9:25 
>> AM Jerry DeLisle <jvdelisle@charter.net> wrote:
>
>>  >> 1) Focus on distribution packages such as Fedora, Debian, Ubuntu,
>>  >> Windows, etc. Building of these packages needs to be automated
>into the
>>  >> distributions.
>>  >
>>  > This is the option that the OpenCoarrays documentation recommends
>as 
>> easiest for
>>  > most users.
>> 
>> Agree.
>
>I just installed opencoarrays on my system at home (Debian Testing):
>
>root@moene:~# apt-get install libcoarrays-openmpi-dev
>Reading package lists... Done
>Building dependency tree
>Reading state information... Done
>The following additional packages will be installed:
>   libcaf-openmpi-3
>The following NEW packages will be installed:
>   libcaf-openmpi-3 libcoarrays-openmpi-dev
>0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
>Need to get 107 kB of archives.
>After this operation, 317 kB of additional disk space will be used.
>Do you want to continue? [Y/n]
>Get:1 http://ftp.nl.debian.org/debian testing/main amd64 
>libcaf-openmpi-3 amd64 2.2.0-3 [38.2 kB]
>Get:2 http://ftp.nl.debian.org/debian testing/main amd64 
>libcoarrays-openmpi-dev amd64 2.2.0-3 [68.9 kB]
>Fetched 107 kB in 0s (634 kB/s)
>Selecting previously unselected package libcaf-openmpi-3:amd64.
>(Reading database ... 212249 files and directories currently
>installed.)
>Preparing to unpack .../libcaf-openmpi-3_2.2.0-3_amd64.deb ...
>Unpacking libcaf-openmpi-3:amd64 (2.2.0-3) ...
>Selecting previously unselected package libcoarrays-openmpi-dev:amd64.
>Preparing to unpack .../libcoarrays-openmpi-dev_2.2.0-3_amd64.deb ...
>Unpacking libcoarrays-openmpi-dev:amd64 (2.2.0-3) ...
>Setting up libcaf-openmpi-3:amd64 (2.2.0-3) ...
>Setting up libcoarrays-openmpi-dev:amd64 (2.2.0-3) ...
>Processing triggers for libc-bin (2.27-6) ...
>
>[ previously this led to apt errors, but not now. ]
>
>and moved my own installation of the OpenCoarrays-2.2.0.tar.gz out of 
>the way:
>
>toon@moene:~$ ls -ld *pen*
>drwxr-xr-x 6 toon toon 4096 Aug 10 16:01 OpenCoarrays-2.2.0.opzij
>drwxr-xr-x 8 toon toon 4096 Sep 15 11:26 opencoarrays-build.opzij
>drwxr-xr-x 6 toon toon 4096 Sep 15 11:26 opencoarrays.opzij
>
>and recompiled my stuff:
>
>gfortran -g -fbacktrace -fcoarray=lib random-weather.f90 
>-L/usr/lib/x86_64-linux-gnu/open-coarrays/openmpi/lib -lcaf_mpi
>
>[ Yes, the location of the libs is quite experimental, but OK for the 
>"Testing" variant of Debian ... ]

Are you sure you need the -L?
For me a simple  -fcoarray=lib -lcaf_mpi
links fine.
Along the same lines a simple
$ mpirun -np 4 ./a.out
runs fine as expected, like any other mpi program.

Cheers,
>
>I couldn't find cafrun, but mpirun works just fine:
>
>toon@moene:~/src$ echo ' &config /' | mpirun --oversubscribe --bind-to 
>none -np 20 ./a.out
>Decomposition information on image    7 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image    6 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   11 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   15 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image    1 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   13 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   12 is    4 *    5 slabs with   21
>* 
>   18 grid cells on this image.
>Decomposition information on image   20 is    4 *    5 slabs with   21
>* 
>   18 grid cells on this image.
>Decomposition information on image    9 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   14 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   16 is    4 *    5 slabs with   21
>* 
>   18 grid cells on this image.
>Decomposition information on image   17 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   18 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image    2 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image    4 is    4 *    5 slabs with   21
>* 
>   18 grid cells on this image.
>Decomposition information on image    5 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image    3 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image    8 is    4 *    5 slabs with   21
>* 
>   18 grid cells on this image.
>Decomposition information on image   10 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>Decomposition information on image   19 is    4 *    5 slabs with   23
>* 
>   18 grid cells on this image.
>
>... etc. (see http://moene.org/~toon/random-weather.f90).
>
>I presume other Linux distributors will follow shortly (this *is*
>Debian 
>Testing, which can be a bit testy at times - but I do trust my main 
>business at home on it for over 15 years now).
>
>Kind regards,

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-23 10:41                   ` Toon Moene
  2018-09-23 18:03                     ` Bernhard Reutner-Fischer
@ 2018-09-24 11:14                     ` Alastair McKinstry
  2018-09-27 12:51                       ` Richard Biener
  1 sibling, 1 reply; 187+ messages in thread
From: Alastair McKinstry @ 2018-09-24 11:14 UTC (permalink / raw)
  To: Toon Moene, Jerry DeLisle, Damian Rouson
  Cc: Thomas Koenig, ams, Janne Blomqvist, gcc patches, gfortran


On 23/09/2018 10:46, Toon Moene wrote:
> On 09/22/2018 01:23 AM, Jerry DeLisle wrote:
>
> I just installed opencoarrays on my system at home (Debian Testing):
>
> root@moene:~# apt-get install libcoarrays-openmpi-dev
> ...
> Setting up libcaf-openmpi-3:amd64 (2.2.0-3) ...
> Setting up libcoarrays-openmpi-dev:amd64 (2.2.0-3) ...
> Processing triggers for libc-bin (2.27-6) ...
>
> [ previously this led to apt errors, but not now. ]
>
> and moved my own installation of the OpenCoarrays-2.2.0.tar.gz out of 
> the way:
>
> toon@moene:~$ ls -ld *pen*
> drwxr-xr-x 6 toon toon 4096 Aug 10 16:01 OpenCoarrays-2.2.0.opzij
> drwxr-xr-x 8 toon toon 4096 Sep 15 11:26 opencoarrays-build.opzij
> drwxr-xr-x 6 toon toon 4096 Sep 15 11:26 opencoarrays.opzij
>
> and recompiled my stuff:
>
> gfortran -g -fbacktrace -fcoarray=lib random-weather.f90 
> -L/usr/lib/x86_64-linux-gnu/open-coarrays/openmpi/lib -lcaf_mpi
>
> [ Yes, the location of the libs is quite experimental, but OK for the 
> "Testing" variant of Debian ... ]
>
> I couldn't find cafrun, but mpirun works just fine:
>
> toon@moene:~/src$ echo ' &config /' | mpirun --oversubscribe --bind-to 
> none -np 20 ./a.out
> Decomposition information on image    7 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image    6 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   11 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   15 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image    1 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   13 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   12 is    4 *    5 slabs with   21 
> *   18 grid cells on this image.
> Decomposition information on image   20 is    4 *    5 slabs with   21 
> *   18 grid cells on this image.
> Decomposition information on image    9 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   14 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   16 is    4 *    5 slabs with   21 
> *   18 grid cells on this image.
> Decomposition information on image   17 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   18 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image    2 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image    4 is    4 *    5 slabs with   21 
> *   18 grid cells on this image.
> Decomposition information on image    5 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image    3 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image    8 is    4 *    5 slabs with   21 
> *   18 grid cells on this image.
> Decomposition information on image   10 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
> Decomposition information on image   19 is    4 *    5 slabs with   23 
> *   18 grid cells on this image.
>
> ... etc. (see http://moene.org/~toon/random-weather.f90).
>
> I presume other Linux distributors will follow shortly (this *is* 
> Debian Testing, which can be a bit testy at times - but I do trust my 
> main business at home on it for over 15 years now).
>
> Kind regards,
>
Thanks, good to see it being tested (I'm the Debian/Ubuntu packager).

caf /cafrun has been dropped (for the moment ? ) in favour of mpirun, 
but I've added pkg-config caf packages so that becomes an option.

    $ pkg-config caf-mpich --libs

    -L/usr/lib/x86_64-linux-gnu/open-coarrays/mpich/lib -lcaf_mpich -Wl,-z,relro -lmpich -lm -lbacktrace -lpthread -lrt

(My thinking is that for libraries in particular, the user need not know 
whether CAF is being used, and if lib foobar uses CAF, then adding a:

     Requires: caf

into the pkg-config file gives you the correct linking transparently.

The "strange" paths are due to Debians multiarch : it is possible to 
include libraries for multiple architectures simultaneously. This works 
ok with pkg-config and cmake , etc (which allow you to set 
PKG_CONFIG_PATH and have multiple pkgconfig files for different libs 
simultaneously) , but currently break wrappers such as caf / cafrun.

I can add a new package for caf / cafrun but would rather not. (W e 
currently don't do non-MPI CAF builds).

There is currently pkg-config files 'caf-mpich' and 'caf-openmpi' for 
testing, and I'm adding a default alias caf -> caf-$(default-MPI)

regards

Alastair




-- 
Alastair McKinstry, <alastair@sceal.ie>, <mckinstry@debian.org>, https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2018-09-22 19:22   ` Andreas Schwab
@ 2018-09-24 16:53     ` Andrew Stubbs
  2018-09-24 17:40       ` Andreas Schwab
  2018-09-25 14:27     ` [patch] Fix AArch64 ILP ICE Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-24 16:53 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: gcc-patches

On 22/09/18 19:51, Andreas Schwab wrote:
> That breaks aarch64 ILP32.

I'm struggling to reproduce this because apparently I don't know how to 
build aarch64 ILP32.

Presumably, in order to be building libgomp this must be an 
aarch64-linux-gnu toolchain, but when I set --with-abi=ilp32 I can't 
build glibc:

In file included from ../nptl/descr.h:24, 
 
 

                  from ../sysdeps/aarch64/nptl/tls.h:44, 
 
 

                  from ../sysdeps/unix/sysv/linux/aarch64/sysdep.h:29, 
 
 

                  from <stdin>:1: 
 
 

../include/setjmp.h:50:3: error: static assertion failed: "offset of 
__saved_mask field of struct __jmp_buf_tag != 184

What should I have done?

Thanks

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2018-09-24 16:53     ` Andrew Stubbs
@ 2018-09-24 17:40       ` Andreas Schwab
  0 siblings, 0 replies; 187+ messages in thread
From: Andreas Schwab @ 2018-09-24 17:40 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

On Sep 24 2018, Andrew Stubbs <ams@codesourcery.com> wrote:

> What should I have done?

Make sure you have the ILP32 patches for glibc and kernel.  You can get
them from the arm/ilp32 branch on sourceware.org
<http://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/arm/ilp32>
and the staging/ilp32-4.17 branch of the arm64 kernel tree
<https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=staging/ilp32-4.17>.

You can also get pre-built packages from
<https://download.opensuse.org/repositories/devel:/ARM:/Factory:/Contrib:/ILP32/standard/>

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [patch] Fix AArch64 ILP ICE
  2018-09-22 19:22   ` Andreas Schwab
  2018-09-24 16:53     ` Andrew Stubbs
@ 2018-09-25 14:27     ` Andrew Stubbs
  2018-09-26  8:55       ` Andreas Schwab
  2018-09-26 13:39       ` Richard Biener
  1 sibling, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-25 14:27 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 271 bytes --]

On 22/09/18 19:51, Andreas Schwab wrote:
> That breaks aarch64 ILP32.

The problem is that the mode given to expand_expr is just a "hint", 
apparently, and it's being ignored.

I'm testing the attached patch for GCN. It fixes the ICE for AArch64 
just fine.

OK?

Andrew

[-- Attachment #2: 180925-fix-aarch64-ice.patch --]
[-- Type: text/x-patch, Size: 752 bytes --]

Fix AArch64 ILP32 ICE.

Ensure that the address really is the correct mode for an address.

2018-09-25  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* builtins.c (get_builtin_sync_mem): Force address mode conversion.

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 1d4de09..956f872 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -5869,6 +5869,7 @@ get_builtin_sync_mem (tree loc, machine_mode mode)
   scalar_int_mode addr_mode = targetm.addr_space.address_mode (addr_space);
 
   addr = expand_expr (loc, NULL_RTX, addr_mode, EXPAND_SUM);
+  addr = convert_memory_address (addr_mode, addr);
 
   /* Note that we explicitly do not want any alias information for this
      memory, so that we kill all other live memories.  Otherwise we don't

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [patch] Fix AArch64 ILP ICE
  2018-09-25 14:27     ` [patch] Fix AArch64 ILP ICE Andrew Stubbs
@ 2018-09-26  8:55       ` Andreas Schwab
  2018-09-26 13:39       ` Richard Biener
  1 sibling, 0 replies; 187+ messages in thread
From: Andreas Schwab @ 2018-09-26  8:55 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

On Sep 25 2018, Andrew Stubbs <ams@codesourcery.com> wrote:

> Ensure that the address really is the correct mode for an address.
>
> 2018-09-25  Andrew Stubbs  <ams@codesourcery.com>
>
> 	gcc/
> 	* builtins.c (get_builtin_sync_mem): Force address mode conversion.

This has survived bootstrap so far.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [patch] Fix AArch64 ILP ICE
  2018-09-25 14:27     ` [patch] Fix AArch64 ILP ICE Andrew Stubbs
  2018-09-26  8:55       ` Andreas Schwab
@ 2018-09-26 13:39       ` Richard Biener
  2018-09-26 16:17         ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Richard Biener @ 2018-09-26 13:39 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: Andreas Schwab, GCC Patches

On Tue, Sep 25, 2018 at 4:25 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 22/09/18 19:51, Andreas Schwab wrote:
> > That breaks aarch64 ILP32.
>
> The problem is that the mode given to expand_expr is just a "hint",
> apparently, and it's being ignored.
>
> I'm testing the attached patch for GCN. It fixes the ICE for AArch64
> just fine.
>
> OK?

OK.

> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode
  2018-09-17  8:46   ` Richard Sandiford
@ 2018-09-26 15:52     ` Andrew Stubbs
  2018-09-26 16:49       ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-26 15:52 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 2383 bytes --]

On 17/09/18 09:40, Richard Sandiford wrote:
> <ams@codesourcery.com> writes:
>> This is an update of the patch posted to PR82089 long ago.  We ran into the
>> same bug on GCN, so we need this fixed as part of this series.
>>
>> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>>              Tom de Vries  <tom@codesourcery.com>
>>
>> 	PR82089
>>
>> 	gcc/
>> 	* expmed.c (emit_cstore): Fix handling of result_mode == BImode and
>> 	STORE_FLAG_VALUE == 1.
>> ---
>>   gcc/expmed.c | 15 +++++++++++----
>>   1 file changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/gcc/expmed.c b/gcc/expmed.c
>> index 29ce10b..0b87fdc 100644
>> --- a/gcc/expmed.c
>> +++ b/gcc/expmed.c
>> @@ -5464,11 +5464,18 @@ emit_cstore (rtx target, enum insn_code icode, enum rtx_code code,
>>        If STORE_FLAG_VALUE does not have the sign bit set when
>>        interpreted in MODE, we can do this conversion as unsigned, which
>>        is usually more efficient.  */
>> -  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode))
>> +  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode)
>> +      || (result_mode == BImode && int_target_mode != BImode))
> 
> Would be better to test GET_MODE_PRECISION instead of GET_MODE_SIZE,
> if that works, instead of treating BImode as a special case.
> 
>>       {
>> -      convert_move (target, subtarget,
>> -		    val_signbit_known_clear_p (result_mode,
>> -					       STORE_FLAG_VALUE));
>> +      gcc_assert (GET_MODE_SIZE (result_mode) != 1
>> +		  || STORE_FLAG_VALUE == 1 || STORE_FLAG_VALUE == -1);
>> +      bool unsignedp
>> +	= (GET_MODE_SIZE (result_mode) == 1
>> +	   ? STORE_FLAG_VALUE == 1
>> +	   : val_signbit_known_clear_p (result_mode, STORE_FLAG_VALUE));
>> +
>> +      convert_move (target, subtarget, unsignedp);
>> +
> 
> GET_MODE_SIZE == 1 would also trigger for QImode, which shouldn't be treated
> differently from HImode etc.
> 
> The original val_signbit_known_clear_p test seems like it might be an
> abstraction too far.  In practice STORE_FLAG_VALUE has to fit within
> the mode of a natural (unextended) condition result, so I think we can
> simply test STORE_FLAG_VALUE >= 0 for all modes to see whether the target
> wants the result to be treated as signed or unsigned.

How about the attached?

I think I addressed all your comments, and it tests fine on GCN with no 
regressions.

Andrew

[-- Attachment #2: 180926-pr82089.patch --]
[-- Type: text/x-patch, Size: 1329 bytes --]

[pr82089] Don't sign-extend SFV 1 in BImode

This is an update of the patch posted to PR82089 long ago.  We ran into the
same bug on GCN, so we need this fixed as part of this series.

2018-09-26  Andrew Stubbs  <ams@codesourcery.com>
            Tom de Vries  <tom@codesourcery.com>

	PR82089

	gcc/
	* expmed.c (emit_cstore): Fix handling of result_mode == BImode and
	STORE_FLAG_VALUE == 1.

diff --git a/gcc/expmed.c b/gcc/expmed.c
index 29ce10b..444d6a8 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -5464,11 +5464,14 @@ emit_cstore (rtx target, enum insn_code icode, enum rtx_code code,
      If STORE_FLAG_VALUE does not have the sign bit set when
      interpreted in MODE, we can do this conversion as unsigned, which
      is usually more efficient.  */
-  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode))
+  if (GET_MODE_PRECISION (int_target_mode) > GET_MODE_PRECISION (result_mode))
     {
-      convert_move (target, subtarget,
-		    val_signbit_known_clear_p (result_mode,
-					       STORE_FLAG_VALUE));
+      gcc_assert (GET_MODE_PRECISION (result_mode) != 1
+		  || STORE_FLAG_VALUE == 1 || STORE_FLAG_VALUE == -1);
+
+      bool unsignedp = (STORE_FLAG_VALUE >= 0);
+      convert_move (target, subtarget, unsignedp);
+
       op0 = target;
       result_mode = int_target_mode;
     }

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [patch] Fix AArch64 ILP ICE
  2018-09-26 13:39       ` Richard Biener
@ 2018-09-26 16:17         ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-26 16:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: Andreas Schwab, GCC Patches

On 26/09/18 14:38, Richard Biener wrote:
> OK.

Committed, thanks.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-09-20 11:42       ` Andrew Stubbs
@ 2018-09-26 16:23         ` Andrew Stubbs
  2018-10-04 18:24         ` Jeff Law
  1 sibling, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-26 16:23 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

Ping.

On 20/09/18 11:52, Andrew Stubbs wrote:
> On 19/09/18 17:38, Andrew Stubbs wrote:
>> Here's an updated patch incorporating the RTL front-end changes. I had 
>> to change from "repeated 2x" to "repeated x2" because the former is 
>> not a valid C token, and apparently that's important.
> 
> Here's a patch with self tests added, for both reading and writing.
> 
> It also fixes a bug when the repeat was the last item in a list.
> 
> OK?
> 
> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-20 15:44     ` Andrew Stubbs
@ 2018-09-26 16:26       ` Andrew Stubbs
  2018-09-26 16:50       ` Richard Sandiford
  1 sibling, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-26 16:26 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

Ping.

On 20/09/18 16:26, Andrew Stubbs wrote:
> On 17/09/18 10:05, Richard Sandiford wrote:
>> Would be good to have self-tests for the new transforms.
> [...]
>> known_eq, since we require equality for correctness.  Same for the
>> other tests.
> 
> How about the attached? I've made the edits you requested and written 
> some self-tests.
> 
>> Doesn't simplify_merge_mask make the second two redundant?  I couldn't
>> see the difference between them and the first condition tested by
>> simplify_merge_mask.
> 
> Yes, I think you're right. Removed, now.
> 
> Andrew
> 

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode
  2018-09-26 15:52     ` Andrew Stubbs
@ 2018-09-26 16:49       ` Richard Sandiford
  2018-09-27 12:20         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-26 16:49 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 09:40, Richard Sandiford wrote:
>> <ams@codesourcery.com> writes:
>>> This is an update of the patch posted to PR82089 long ago.  We ran into the
>>> same bug on GCN, so we need this fixed as part of this series.
>>>
>>> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>>>              Tom de Vries  <tom@codesourcery.com>
>>>
>>> 	PR82089
>>>
>>> 	gcc/
>>> 	* expmed.c (emit_cstore): Fix handling of result_mode == BImode and
>>> 	STORE_FLAG_VALUE == 1.
>>> ---
>>>   gcc/expmed.c | 15 +++++++++++----
>>>   1 file changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/gcc/expmed.c b/gcc/expmed.c
>>> index 29ce10b..0b87fdc 100644
>>> --- a/gcc/expmed.c
>>> +++ b/gcc/expmed.c
>>> @@ -5464,11 +5464,18 @@ emit_cstore (rtx target, enum insn_code icode, enum rtx_code code,
>>>        If STORE_FLAG_VALUE does not have the sign bit set when
>>>        interpreted in MODE, we can do this conversion as unsigned, which
>>>        is usually more efficient.  */
>>> -  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode))
>>> +  if (GET_MODE_SIZE (int_target_mode) > GET_MODE_SIZE (result_mode)
>>> +      || (result_mode == BImode && int_target_mode != BImode))
>> 
>> Would be better to test GET_MODE_PRECISION instead of GET_MODE_SIZE,
>> if that works, instead of treating BImode as a special case.
>> 
>>>       {
>>> -      convert_move (target, subtarget,
>>> -		    val_signbit_known_clear_p (result_mode,
>>> -					       STORE_FLAG_VALUE));
>>> +      gcc_assert (GET_MODE_SIZE (result_mode) != 1
>>> +		  || STORE_FLAG_VALUE == 1 || STORE_FLAG_VALUE == -1);
>>> +      bool unsignedp
>>> +	= (GET_MODE_SIZE (result_mode) == 1
>>> +	   ? STORE_FLAG_VALUE == 1
>>> +	   : val_signbit_known_clear_p (result_mode, STORE_FLAG_VALUE));
>>> +
>>> +      convert_move (target, subtarget, unsignedp);
>>> +
>> 
>> GET_MODE_SIZE == 1 would also trigger for QImode, which shouldn't be treated
>> differently from HImode etc.
>> 
>> The original val_signbit_known_clear_p test seems like it might be an
>> abstraction too far.  In practice STORE_FLAG_VALUE has to fit within
>> the mode of a natural (unextended) condition result, so I think we can
>> simply test STORE_FLAG_VALUE >= 0 for all modes to see whether the target
>> wants the result to be treated as signed or unsigned.
>
> How about the attached?
>
> I think I addressed all your comments, and it tests fine on GCN with no 
> regressions.
>
> Andrew
>
> [pr82089] Don't sign-extend SFV 1 in BImode
>
> This is an update of the patch posted to PR82089 long ago.  We ran into the
> same bug on GCN, so we need this fixed as part of this series.
>
> 2018-09-26  Andrew Stubbs  <ams@codesourcery.com>
>             Tom de Vries  <tom@codesourcery.com>
>
> 	PR82089
>
> 	gcc/
> 	* expmed.c (emit_cstore): Fix handling of result_mode == BImode and
> 	STORE_FLAG_VALUE == 1.

OK, thanks.

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-20 15:44     ` Andrew Stubbs
  2018-09-26 16:26       ` Andrew Stubbs
@ 2018-09-26 16:50       ` Richard Sandiford
  2018-09-26 17:06         ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-26 16:50 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 10:05, Richard Sandiford wrote:
>> Would be good to have self-tests for the new transforms.
> [...]
>> known_eq, since we require equality for correctness.  Same for the
>> other tests.
>
> How about the attached? I've made the edits you requested and written 
> some self-tests.
>
>> Doesn't simplify_merge_mask make the second two redundant?  I couldn't
>> see the difference between them and the first condition tested by
>> simplify_merge_mask.
>
> Yes, I think you're right. Removed, now.
>
> Andrew
>
> Simplify vec_merge according to the mask.
>
> This patch was part of the original patch we acquired from Honza and Martin.
>
> It simplifies nested vec_merge operations using the same mask.
>
> Self-tests are included.
>
> 2018-09-20  Andrew Stubbs  <ams@codesourcery.com>
> 	    Jan Hubicka  <jh@suse.cz>
> 	    Martin Jambor  <mjambor@suse.cz>
>
> 	* simplify-rtx.c (simplify_merge_mask): New function.
> 	(simplify_ternary_operation): Use it, also see if VEC_MERGEs with the
> 	same masks are used in op1 or op2.
> 	(test_vec_merge): New function.
> 	(test_vector_ops): Call test_vec_merge.
>
> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
> index f77e1aa..13b2882 100644
> --- a/gcc/simplify-rtx.c
> +++ b/gcc/simplify-rtx.c
> @@ -5578,6 +5578,68 @@ simplify_cond_clz_ctz (rtx x, rtx_code cmp_code, rtx true_val, rtx false_val)
>    return NULL_RTX;
>  }
>  
> +/* Try to simplify nested VEC_MERGE operations by comparing the masks.  The
> +   nested operations need not use the same vector mode, but must have the same
> +   number of elements.
> +
> +   X is an operand number OP of a VEC_MERGE operation with MASK.
> +   Returns NULL_RTX if no simplification is possible.  */

X isn't always operand OP, it can be nested within it.  How about:

/* Try to simplify X given that it appears within operand OP of a
   VEC_MERGE operation whose mask is MASK.  X need not use the same
   vector mode as the VEC_MERGE, but it must have the same number of
   elements.

   Return the simplified X on success, otherwise return NULL_RTX.  */

> +
> +rtx
> +simplify_merge_mask (rtx x, rtx mask, int op)
> +{
> +  gcc_assert (VECTOR_MODE_P (GET_MODE (x)));
> +  poly_uint64 nunits = GET_MODE_NUNITS (GET_MODE (x));
> +  if (GET_CODE (x) == VEC_MERGE && rtx_equal_p (XEXP (x, 2), mask))
> +    {
> +      if (!side_effects_p (XEXP (x, 1 - op)))
> +	return XEXP (x, op);
> +    }
> +  if (side_effects_p (x))
> +    return NULL_RTX;
> +  if (UNARY_P (x)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
> +      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits))
> +    {
> +      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
> +      if (top0)
> +	return simplify_gen_unary (GET_CODE (x), GET_MODE (x), top0,
> +				   GET_MODE (XEXP (x, 0)));
> +    }
> +  if (BINARY_P (x)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
> +      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
> +      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits))
> +    {
> +      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
> +      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
> +      if (top0 || top1)
> +	return simplify_gen_binary (GET_CODE (x), GET_MODE (x),
> +				    top0 ? top0 : XEXP (x, 0),
> +				    top1 ? top1 : XEXP (x, 1));
> +    }
> +  if (GET_RTX_CLASS (GET_CODE (x)) == RTX_TERNARY
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
> +      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
> +      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits)
> +      && VECTOR_MODE_P (GET_MODE (XEXP (x, 2)))
> +      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 2))), nunits))
> +    {
> +      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
> +      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
> +      rtx top2 = simplify_merge_mask (XEXP (x, 2), mask, op);
> +      if (top0 || top1)

|| top2?

> +	return simplify_gen_ternary (GET_CODE (x), GET_MODE (x),
> +				     GET_MODE (XEXP (x, 0)),
> +				     top0 ? top0 : XEXP (x, 0),
> +				     top1 ? top1 : XEXP (x, 1),
> +				     top2 ? top2 : XEXP (x, 2));
> +    }
> +  return NULL_RTX;
> +}
> +
>  \f
>  /* Simplify CODE, an operation with result mode MODE and three operands,
>     OP0, OP1, and OP2.  OP0_MODE was the mode of OP0 before it became
> @@ -5967,6 +6029,16 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
>  	  && !side_effects_p (op2) && !side_effects_p (op1))
>  	return op0;
>  
> +      if (!side_effects_p (op2))
> +	{
> +	  rtx top0 = simplify_merge_mask (op0, op2, 0);
> +	  rtx top1 = simplify_merge_mask (op1, op2, 1);
> +	  if (top0 || top1)
> +	    return simplify_gen_ternary (code, mode, mode,
> +					 top0 ? top0 : op0,
> +					 top1 ? top1 : op1, op2);
> +	}
> +
>        break;
>  
>      default:
> @@ -6932,6 +7004,71 @@ test_vector_ops_series (machine_mode mode, rtx scalar_reg)
>  					    constm1_rtx));
>  }
>  
> +/* Verify simplify_merge_mask works correctly.  */
> +
> +static void
> +test_vec_merge (machine_mode mode)
> +{
> +  rtx op0 = make_test_reg (mode);
> +  rtx op1 = make_test_reg (mode);
> +  rtx op2 = make_test_reg (mode);
> +  rtx op3 = make_test_reg (mode);
> +  rtx op4 = make_test_reg (mode);
> +  rtx op5 = make_test_reg (mode);
> +  rtx mask1 = make_test_reg (SImode);
> +  rtx mask2 = make_test_reg (SImode);
> +  rtx vm1 = gen_rtx_VEC_MERGE (mode, op0, op1, mask1);
> +  rtx vm2 = gen_rtx_VEC_MERGE (mode, op2, op3, mask1);
> +  rtx vm3 = gen_rtx_VEC_MERGE (mode, op4, op5, mask1);
> +
> +  /* Simple vec_merge.  */
> +  ASSERT_EQ (op0, simplify_merge_mask (vm1, mask1, 0));
> +  ASSERT_EQ (op1, simplify_merge_mask (vm1, mask1, 1));
> +  ASSERT_EQ (NULL_RTX, simplify_merge_mask (vm1, mask2, 0));
> +  ASSERT_EQ (NULL_RTX, simplify_merge_mask (vm1, mask2, 1));
> +
> +  /* Nested vec_merge.  */
> +  rtx nvm = gen_rtx_VEC_MERGE (mode, vm1, vm2, mask1);
> +  ASSERT_EQ (vm1, simplify_merge_mask (nvm, mask1, 0));
> +  ASSERT_EQ (vm2, simplify_merge_mask (nvm, mask1, 1));

Think the last two should simplify to op0 and op3, which I guess
means recursing on the "return XEXP (x, op);"

> +  /* Intermediate unary op. */
> +  rtx unop = gen_rtx_NOT (mode, vm1);
> +  ASSERT_EQ (op0, XEXP (simplify_merge_mask (unop, mask1, 0), 0));
> +  ASSERT_EQ (op1, XEXP (simplify_merge_mask (unop, mask1, 1), 0));
> +
> +  /* Intermediate binary op. */
> +  rtx binop = gen_rtx_PLUS (mode, vm1, vm2);
> +  rtx res = simplify_merge_mask (binop, mask1, 0);
> +  ASSERT_EQ (op0, XEXP (res, 0));
> +  ASSERT_EQ (op2, XEXP (res, 1));
> +  res = simplify_merge_mask (binop, mask1, 1);
> +  ASSERT_EQ (op1, XEXP (res, 0));
> +  ASSERT_EQ (op3, XEXP (res, 1));
> +
> +  /* Intermediate ternary op. */
> +  rtx tenop = gen_rtx_FMA (mode, vm1, vm2, vm3);
> +  res = simplify_merge_mask (tenop, mask1, 0);
> +  ASSERT_EQ (op0, XEXP (res, 0));
> +  ASSERT_EQ (op2, XEXP (res, 1));
> +  ASSERT_EQ (op4, XEXP (res, 2));
> +  res = simplify_merge_mask (tenop, mask1, 1);
> +  ASSERT_EQ (op1, XEXP (res, 0));
> +  ASSERT_EQ (op3, XEXP (res, 1));
> +  ASSERT_EQ (op5, XEXP (res, 2));
> [...]
> +  /* Called indirectly.  */
> +  res = simplify_rtx (nvm);
> +  ASSERT_EQ (op0, XEXP (res, 0));
> +  ASSERT_EQ (op3, XEXP (res, 1));

Would probably be better to ASSERT_RTX_EQ against the full simplified rtx,
e.g. gen_rtx_NOT (mode, op0)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-26 16:50       ` Richard Sandiford
@ 2018-09-26 17:06         ` Andrew Stubbs
  2018-09-27  7:28           ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-26 17:06 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 26/09/18 17:48, Richard Sandiford wrote:
> Andrew Stubbs <ams@codesourcery.com> writes:
>> +  /* Nested vec_merge.  */
>> +  rtx nvm = gen_rtx_VEC_MERGE (mode, vm1, vm2, mask1);
>> +  ASSERT_EQ (vm1, simplify_merge_mask (nvm, mask1, 0));
>> +  ASSERT_EQ (vm2, simplify_merge_mask (nvm, mask1, 1));
> 
> Think the last two should simplify to op0 and op3, which I guess
> means recursing on the "return XEXP (x, op);"

I thought about doing that, but I noticed that, for example, 
simplify_gen_unary does not recurse into its operand. Is that an 
omission, or is it expected that those operands will already have been 
simplified?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-26 17:06         ` Andrew Stubbs
@ 2018-09-27  7:28           ` Richard Sandiford
  2018-09-27 14:13             ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-27  7:28 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 26/09/18 17:48, Richard Sandiford wrote:
>> Andrew Stubbs <ams@codesourcery.com> writes:
>>> +  /* Nested vec_merge.  */
>>> +  rtx nvm = gen_rtx_VEC_MERGE (mode, vm1, vm2, mask1);
>>> +  ASSERT_EQ (vm1, simplify_merge_mask (nvm, mask1, 0));
>>> +  ASSERT_EQ (vm2, simplify_merge_mask (nvm, mask1, 1));
>> 
>> Think the last two should simplify to op0 and op3, which I guess
>> means recursing on the "return XEXP (x, op);"
>
> I thought about doing that, but I noticed that, for example, 
> simplify_gen_unary does not recurse into its operand. Is that an 
> omission, or is it expected that those operands will already have been 
> simplified?

Ah, yeah, each operand should already fully be simplified.  But then the
only thing we testing here compared to:

  /* Simple vec_merge.  */
  ASSERT_EQ (op0, simplify_merge_mask (vm1, mask1, 0));
  ASSERT_EQ (op1, simplify_merge_mask (vm1, mask1, 1));

is that we *don't* recurse.  It would be worth adding a comment
to say that, since if we both thought about it, I guess whoever
comes next will too.

And the assumption that existing VEC_MERGEs are fully simplified means
we should return null:

  if (GET_CODE (x) == VEC_MERGE && rtx_equal_p (XEXP (x, 2), mask))
    {
      if (!side_effects_p (XEXP (x, 1 - op)))
	return XEXP (x, op);
--->here
    }

On keeping the complexity down:

  if (side_effects_p (x))
    return NULL_RTX;

makes this quadratic for chains of unary operations.  Is it really
needed?  The code after it simply recurses on operands and doesn't
discard anything itself, so it looks like the VEC_MERGE call to
side_effects_p would be enough.

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode
  2018-09-26 16:49       ` Richard Sandiford
@ 2018-09-27 12:20         ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-27 12:20 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 26/09/18 17:25, Richard Sandiford wrote:
> OK, thanks.

Committed, thanks.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: OpenCoarrays integration with gfortran
  2018-09-24 11:14                     ` Alastair McKinstry
@ 2018-09-27 12:51                       ` Richard Biener
  0 siblings, 0 replies; 187+ messages in thread
From: Richard Biener @ 2018-09-27 12:51 UTC (permalink / raw)
  To: mckinstry
  Cc: Toon Moene, Jerry DeLisle, Damian Rouson, Thomas Koenig, Stubbs,
	Andrew, Janne Blomqvist, GCC Patches, fortran

On Mon, Sep 24, 2018 at 12:58 PM Alastair McKinstry
<mckinstry@debian.org> wrote:
>
>
> On 23/09/2018 10:46, Toon Moene wrote:
> > On 09/22/2018 01:23 AM, Jerry DeLisle wrote:
> >
> > I just installed opencoarrays on my system at home (Debian Testing):
> >
> > root@moene:~# apt-get install libcoarrays-openmpi-dev
> > ...
> > Setting up libcaf-openmpi-3:amd64 (2.2.0-3) ...
> > Setting up libcoarrays-openmpi-dev:amd64 (2.2.0-3) ...
> > Processing triggers for libc-bin (2.27-6) ...
> >
> > [ previously this led to apt errors, but not now. ]
> >
> > and moved my own installation of the OpenCoarrays-2.2.0.tar.gz out of
> > the way:
> >
> > toon@moene:~$ ls -ld *pen*
> > drwxr-xr-x 6 toon toon 4096 Aug 10 16:01 OpenCoarrays-2.2.0.opzij
> > drwxr-xr-x 8 toon toon 4096 Sep 15 11:26 opencoarrays-build.opzij
> > drwxr-xr-x 6 toon toon 4096 Sep 15 11:26 opencoarrays.opzij
> >
> > and recompiled my stuff:
> >
> > gfortran -g -fbacktrace -fcoarray=lib random-weather.f90
> > -L/usr/lib/x86_64-linux-gnu/open-coarrays/openmpi/lib -lcaf_mpi
> >
> > [ Yes, the location of the libs is quite experimental, but OK for the
> > "Testing" variant of Debian ... ]
> >
> > I couldn't find cafrun, but mpirun works just fine:
> >
> > toon@moene:~/src$ echo ' &config /' | mpirun --oversubscribe --bind-to
> > none -np 20 ./a.out
> > Decomposition information on image    7 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image    6 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   11 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   15 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image    1 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   13 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   12 is    4 *    5 slabs with   21
> > *   18 grid cells on this image.
> > Decomposition information on image   20 is    4 *    5 slabs with   21
> > *   18 grid cells on this image.
> > Decomposition information on image    9 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   14 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   16 is    4 *    5 slabs with   21
> > *   18 grid cells on this image.
> > Decomposition information on image   17 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   18 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image    2 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image    4 is    4 *    5 slabs with   21
> > *   18 grid cells on this image.
> > Decomposition information on image    5 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image    3 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image    8 is    4 *    5 slabs with   21
> > *   18 grid cells on this image.
> > Decomposition information on image   10 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> > Decomposition information on image   19 is    4 *    5 slabs with   23
> > *   18 grid cells on this image.
> >
> > ... etc. (see http://moene.org/~toon/random-weather.f90).
> >
> > I presume other Linux distributors will follow shortly (this *is*
> > Debian Testing, which can be a bit testy at times - but I do trust my
> > main business at home on it for over 15 years now).
> >
> > Kind regards,
> >
> Thanks, good to see it being tested (I'm the Debian/Ubuntu packager).
>
> caf /cafrun has been dropped (for the moment ? ) in favour of mpirun,
> but I've added pkg-config caf packages so that becomes an option.
>
>     $ pkg-config caf-mpich --libs
>
>     -L/usr/lib/x86_64-linux-gnu/open-coarrays/mpich/lib -lcaf_mpich -Wl,-z,relro -lmpich -lm -lbacktrace -lpthread -lrt
>
> (My thinking is that for libraries in particular, the user need not know
> whether CAF is being used, and if lib foobar uses CAF, then adding a:
>
>      Requires: caf
>
> into the pkg-config file gives you the correct linking transparently.
>
> The "strange" paths are due to Debians multiarch : it is possible to
> include libraries for multiple architectures simultaneously. This works
> ok with pkg-config and cmake , etc (which allow you to set
> PKG_CONFIG_PATH and have multiple pkgconfig files for different libs
> simultaneously) , but currently break wrappers such as caf / cafrun.
>
> I can add a new package for caf / cafrun but would rather not. (W e
> currently don't do non-MPI CAF builds).
>
> There is currently pkg-config files 'caf-mpich' and 'caf-openmpi' for
> testing, and I'm adding a default alias caf -> caf-$(default-MPI)

So I've tried packaging of OpenCoarrays for SUSE and noticed a few things:

 - caf by default links libcaf_mpi static (why?)
 - the build system makes the libcaf_mpi SONAME dependent on the compiler
   version(?), I once got libcaf_mpi2 and once libcaf_mpi3 (gcc7 vs. gcc8)

different SONAMEs definitely makes packaging difficult.  Of course since
there's the first point I may very well elide the shared library
alltogether....?

Other than that it seems to "work" (OBS home:rguenther/OpenCoarrays).

Richard.

> regards
>
> Alastair
>
>
>
>
> --
> Alastair McKinstry, <alastair@sceal.ie>, <mckinstry@debian.org>, https://diaspora.sceal.ie/u/amckinstry
> Misentropy: doubting that the Universe is becoming more disordered.
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-27  7:28           ` Richard Sandiford
@ 2018-09-27 14:13             ` Andrew Stubbs
  2018-09-27 16:28               ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-27 14:13 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 27/09/18 08:16, Richard Sandiford wrote:
> On keeping the complexity down:
> 
>    if (side_effects_p (x))
>      return NULL_RTX;
> 
> makes this quadratic for chains of unary operations.  Is it really
> needed?  The code after it simply recurses on operands and doesn't
> discard anything itself, so it looks like the VEC_MERGE call to
> side_effects_p would be enough.

The two calls do not check the same thing. The other one checks the 
other operand of a vec_merge, and this checks the current operand.

I suppose it's safe to discard a VEC_MERGE when the chosen operand 
contains side effects, but I'm not so sure when the VEC_MERGE itself is 
an operand to an operator with side effects. I'm having a hard time 
inventing a scenario in which a PRE_INC could contain a VEC_MERGE, but 
maybe a volatile MEM or ASM_OPERANDS could do?

Conversely, I don't see that side-effects deep down in an expression 
should stop us transforming it as a high level.

Is there an equivalent to side_effects_p that doesn't recurse? Should 
there be?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-27 14:13             ` Andrew Stubbs
@ 2018-09-27 16:28               ` Richard Sandiford
  2018-09-27 21:14                 ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-27 16:28 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 27/09/18 08:16, Richard Sandiford wrote:
>> On keeping the complexity down:
>> 
>>    if (side_effects_p (x))
>>      return NULL_RTX;
>> 
>> makes this quadratic for chains of unary operations.  Is it really
>> needed?  The code after it simply recurses on operands and doesn't
>> discard anything itself, so it looks like the VEC_MERGE call to
>> side_effects_p would be enough.
>
> The two calls do not check the same thing. The other one checks the 
> other operand of a vec_merge, and this checks the current operand.
>
> I suppose it's safe to discard a VEC_MERGE when the chosen operand 
> contains side effects, but I'm not so sure when the VEC_MERGE itself is 
> an operand to an operator with side effects. I'm having a hard time 
> inventing a scenario in which a PRE_INC could contain a VEC_MERGE, but 
> maybe a volatile MEM or ASM_OPERANDS could do?

But we wouldn't recurse for PRE_INC, MEM or ASM_OPERANDS, since they
have the wrong rtx class.  AFAICT no current unary, binary or ternary
operator has that level of side-effect (and that's a good thing).

We also don't guarantee to preserve FP exceptions as side-effects.

> Conversely, I don't see that side-effects deep down in an expression 
> should stop us transforming it as a high level.
>
> Is there an equivalent to side_effects_p that doesn't recurse? Should 
> there be?

Not aware of an existing function, and it might be useful to have
one at some point.  Just not sure we need it for this.

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-27 16:28               ` Richard Sandiford
@ 2018-09-27 21:14                 ` Andrew Stubbs
  2018-09-28  8:42                   ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-27 21:14 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

On 27/09/18 17:19, Richard Sandiford wrote:
> But we wouldn't recurse for PRE_INC, MEM or ASM_OPERANDS, since they
> have the wrong rtx class.  AFAICT no current unary, binary or ternary
> operator has that level of side-effect (and that's a good thing).

OK, in that case I'll remove it and we can cross that bridge if we come 
to it.

This patch should also address your other concerns.

OK?

Andrew

[-- Attachment #2: 180927-simplify-merge-mask.patch --]
[-- Type: text/x-patch, Size: 6623 bytes --]

Simplify vec_merge according to the mask.

This patch was part of the original patch we acquired from Honza and Martin.

It simplifies nested vec_merge operations using the same mask.

Self-tests are included.

2018-09-27  Andrew Stubbs  <ams@codesourcery.com>
	    Jan Hubicka  <jh@suse.cz>
	    Martin Jambor  <mjambor@suse.cz>

	* simplify-rtx.c (simplify_merge_mask): New function.
	(simplify_ternary_operation): Use it, also see if VEC_MERGEs with the
	same masks are used in op1 or op2.
	(test_vec_merge): New function.
	(test_vector_ops): Call test_vec_merge.

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index b4c6883..9bc5386 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5578,6 +5578,68 @@ simplify_cond_clz_ctz (rtx x, rtx_code cmp_code, rtx true_val, rtx false_val)
   return NULL_RTX;
 }
 
+/* Try to simplify X given that it appears within operand OP of a
+   VEC_MERGE operation whose mask is MASK.  X need not use the same
+   vector mode as the VEC_MERGE, but it must have the same number of
+   elements.
+
+   Return the simplified X on success, otherwise return NULL_RTX.  */
+
+rtx
+simplify_merge_mask (rtx x, rtx mask, int op)
+{
+  gcc_assert (VECTOR_MODE_P (GET_MODE (x)));
+  poly_uint64 nunits = GET_MODE_NUNITS (GET_MODE (x));
+  if (GET_CODE (x) == VEC_MERGE && rtx_equal_p (XEXP (x, 2), mask))
+    {
+      if (side_effects_p (XEXP (x, 1 - op)))
+	return NULL_RTX;
+
+      return XEXP (x, op);
+    }
+  if (UNARY_P (x)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      if (top0)
+	return simplify_gen_unary (GET_CODE (x), GET_MODE (x), top0,
+				   GET_MODE (XEXP (x, 0)));
+    }
+  if (BINARY_P (x)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
+      if (top0 || top1)
+	return simplify_gen_binary (GET_CODE (x), GET_MODE (x),
+				    top0 ? top0 : XEXP (x, 0),
+				    top1 ? top1 : XEXP (x, 1));
+    }
+  if (GET_RTX_CLASS (GET_CODE (x)) == RTX_TERNARY
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 0))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 1)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 1))), nunits)
+      && VECTOR_MODE_P (GET_MODE (XEXP (x, 2)))
+      && known_eq (GET_MODE_NUNITS (GET_MODE (XEXP (x, 2))), nunits))
+    {
+      rtx top0 = simplify_merge_mask (XEXP (x, 0), mask, op);
+      rtx top1 = simplify_merge_mask (XEXP (x, 1), mask, op);
+      rtx top2 = simplify_merge_mask (XEXP (x, 2), mask, op);
+      if (top0 || top1 || top2)
+	return simplify_gen_ternary (GET_CODE (x), GET_MODE (x),
+				     GET_MODE (XEXP (x, 0)),
+				     top0 ? top0 : XEXP (x, 0),
+				     top1 ? top1 : XEXP (x, 1),
+				     top2 ? top2 : XEXP (x, 2));
+    }
+  return NULL_RTX;
+}
+
 \f
 /* Simplify CODE, an operation with result mode MODE and three operands,
    OP0, OP1, and OP2.  OP0_MODE was the mode of OP0 before it became
@@ -5967,6 +6029,16 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 	  && !side_effects_p (op2) && !side_effects_p (op1))
 	return op0;
 
+      if (!side_effects_p (op2))
+	{
+	  rtx top0 = simplify_merge_mask (op0, op2, 0);
+	  rtx top1 = simplify_merge_mask (op1, op2, 1);
+	  if (top0 || top1)
+	    return simplify_gen_ternary (code, mode, mode,
+					 top0 ? top0 : op0,
+					 top1 ? top1 : op1, op2);
+	}
+
       break;
 
     default:
@@ -6856,6 +6928,69 @@ test_vector_ops_series (machine_mode mode, rtx scalar_reg)
 					    constm1_rtx));
 }
 
+/* Verify simplify_merge_mask works correctly.  */
+
+static void
+test_vec_merge (machine_mode mode)
+{
+  rtx op0 = make_test_reg (mode);
+  rtx op1 = make_test_reg (mode);
+  rtx op2 = make_test_reg (mode);
+  rtx op3 = make_test_reg (mode);
+  rtx op4 = make_test_reg (mode);
+  rtx op5 = make_test_reg (mode);
+  rtx mask1 = make_test_reg (SImode);
+  rtx mask2 = make_test_reg (SImode);
+  rtx vm1 = gen_rtx_VEC_MERGE (mode, op0, op1, mask1);
+  rtx vm2 = gen_rtx_VEC_MERGE (mode, op2, op3, mask1);
+  rtx vm3 = gen_rtx_VEC_MERGE (mode, op4, op5, mask1);
+
+  /* Simple vec_merge.  */
+  ASSERT_EQ (op0, simplify_merge_mask (vm1, mask1, 0));
+  ASSERT_EQ (op1, simplify_merge_mask (vm1, mask1, 1));
+  ASSERT_EQ (NULL_RTX, simplify_merge_mask (vm1, mask2, 0));
+  ASSERT_EQ (NULL_RTX, simplify_merge_mask (vm1, mask2, 1));
+
+  /* Nested vec_merge.
+     It's tempting to make this simplify right down to opN, but we don't
+     because all the simplify_* functions assume that the operands have
+     already been simplified.  */
+  rtx nvm = gen_rtx_VEC_MERGE (mode, vm1, vm2, mask1);
+  ASSERT_EQ (vm1, simplify_merge_mask (nvm, mask1, 0));
+  ASSERT_EQ (vm2, simplify_merge_mask (nvm, mask1, 1));
+
+  /* Intermediate unary op. */
+  rtx unop = gen_rtx_NOT (mode, vm1);
+  ASSERT_RTX_EQ (gen_rtx_NOT (mode, op0),
+		 simplify_merge_mask (unop, mask1, 0));
+  ASSERT_RTX_EQ (gen_rtx_NOT (mode, op1),
+		 simplify_merge_mask (unop, mask1, 1));
+
+  /* Intermediate binary op. */
+  rtx binop = gen_rtx_PLUS (mode, vm1, vm2);
+  ASSERT_RTX_EQ (gen_rtx_PLUS (mode, op0, op2), 
+		 simplify_merge_mask (binop, mask1, 0));
+  ASSERT_RTX_EQ (gen_rtx_PLUS (mode, op1, op3),
+		 simplify_merge_mask (binop, mask1, 1));
+
+  /* Intermediate ternary op. */
+  rtx tenop = gen_rtx_FMA (mode, vm1, vm2, vm3);
+  ASSERT_RTX_EQ (gen_rtx_FMA (mode, op0, op2, op4),
+		 simplify_merge_mask (tenop, mask1, 0));
+  ASSERT_RTX_EQ (gen_rtx_FMA (mode, op1, op3, op5),
+		 simplify_merge_mask (tenop, mask1, 1));
+
+  /* Side effects.  */
+  rtx badop0 = gen_rtx_PRE_INC (mode, op0);
+  rtx badvm = gen_rtx_VEC_MERGE (mode, badop0, op1, mask1);
+  ASSERT_EQ (badop0, simplify_merge_mask (badvm, mask1, 0));
+  ASSERT_EQ (NULL_RTX, simplify_merge_mask (badvm, mask1, 1));
+
+  /* Called indirectly.  */
+  ASSERT_RTX_EQ (gen_rtx_VEC_MERGE (mode, op0, op3, mask1),
+		 simplify_rtx (nvm));
+}
+
 /* Verify some simplifications involving vectors.  */
 
 static void
@@ -6871,6 +7006,7 @@ test_vector_ops ()
 	  if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
 	      && maybe_gt (GET_MODE_NUNITS (mode), 2))
 	    test_vector_ops_series (mode, scalar_reg);
+	  test_vec_merge (mode);
 	}
     }
 }

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 15/25] Don't double-count early-clobber matches.
  2018-09-17  9:22   ` Richard Sandiford
@ 2018-09-27 22:54     ` Andrew Stubbs
  2018-10-04 22:43       ` Richard Sandiford
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-27 22:54 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 421 bytes --]

On 17/09/18 10:18, Richard Sandiford wrote:
> The idea looks good to me FWIW, but you can't use curr_static_id for
> the state, since that's a static description of the .md pattern rather
> than data about this particular instance.

I clearly misunderstood what that was for.

This patch does the same thing, but uses a local variable to store the 
state. That probably means it does it more correctly, too.

OK?

Andrew

[-- Attachment #2: 180927-early-clobber-reject.patch --]
[-- Type: text/x-patch, Size: 3362 bytes --]

Don't double-count early-clobber matches.

Given a pattern with a number of operands:

(match_operand 0 "" "=&v")
(match_operand 1 "" " v0")
(match_operand 2 "" " v0")
(match_operand 3 "" " v0")

GCC will currently increment "reject" once, for operand 0, and then decrement
it once for each of the other operands, ending with reject == -2 and an
assertion failure.  If there's a conflict then it might try to decrement reject
yet again.

Incidentally, what these patterns are trying to achieve is an allocation in
which operand 0 may match one of the other operands, but may not partially
overlap any of them.  Ideally there'd be a better way to do this.

In any case, it will affect any pattern in which multiple operands may (or
must) match an early-clobber operand.

The patch only allows a reject-- when one has not already occurred, for that
operand.

2018-09-27  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* lra-constraints.c (process_alt_operands): Check
	matching_early_clobber before decrementing reject, and set
	matching_early_clobber after.
	* lra-int.h (struct lra_operand_data): Add matching_early_clobber.
	* lra.c (setup_operand_alternative): Initialize matching_early_clobber.

diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 774d1ff..e1d1688 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -1969,6 +1969,7 @@ process_alt_operands (int only_alternative)
       if (!TEST_BIT (preferred, nalt))
 	continue;
 
+      bool matching_early_clobber[MAX_RECOG_OPERANDS] = {};
       curr_small_class_check++;
       overall = losers = addr_losers = 0;
       static_reject = reject = reload_nregs = reload_sum = 0;
@@ -2175,7 +2176,11 @@ process_alt_operands (int only_alternative)
 				 "            %d Matching earlyclobber alt:"
 				 " reject--\n",
 				 nop);
-			    reject--;
+			    if (!matching_early_clobber[m])
+			      {
+				reject--;
+				matching_early_clobber[m] = 1;
+			      }
 			  }
 			/* Otherwise we prefer no matching
 			   alternatives because it gives more freedom
@@ -2921,15 +2926,11 @@ process_alt_operands (int only_alternative)
 	      curr_alt_dont_inherit_ops[curr_alt_dont_inherit_ops_num++]
 		= last_conflict_j;
 	      losers++;
-	      /* Early clobber was already reflected in REJECT. */
-	      lra_assert (reject > 0);
 	      if (lra_dump_file != NULL)
 		fprintf
 		  (lra_dump_file,
 		   "            %d Conflict early clobber reload: reject--\n",
 		   i);
-	      reject--;
-	      overall += LRA_LOSER_COST_FACTOR - 1;
 	    }
 	  else
 	    {
@@ -2953,17 +2954,21 @@ process_alt_operands (int only_alternative)
 		}
 	      curr_alt_win[i] = curr_alt_match_win[i] = false;
 	      losers++;
-	      /* Early clobber was already reflected in REJECT. */
-	      lra_assert (reject > 0);
 	      if (lra_dump_file != NULL)
 		fprintf
 		  (lra_dump_file,
 		   "            %d Matched conflict early clobber reloads: "
 		   "reject--\n",
 		   i);
+	    }
+	  /* Early clobber was already reflected in REJECT. */
+	  if (!matching_early_clobber[i])
+	    {
+	      lra_assert (reject > 0);
 	      reject--;
-	      overall += LRA_LOSER_COST_FACTOR - 1;
+	      matching_early_clobber[i] = 1;
 	    }
+	  overall += LRA_LOSER_COST_FACTOR - 1;
 	}
       if (lra_dump_file != NULL)
 	fprintf (lra_dump_file, "          alt=%d,overall=%d,losers=%d,rld_nregs=%d\n",

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-27 21:14                 ` Andrew Stubbs
@ 2018-09-28  8:42                   ` Richard Sandiford
  2018-09-28 13:50                     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-09-28  8:42 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 27/09/18 17:19, Richard Sandiford wrote:
>> But we wouldn't recurse for PRE_INC, MEM or ASM_OPERANDS, since they
>> have the wrong rtx class.  AFAICT no current unary, binary or ternary
>> operator has that level of side-effect (and that's a good thing).
>
> OK, in that case I'll remove it and we can cross that bridge if we come 
> to it.
>
> This patch should also address your other concerns.
>
> OK?

Yes, thanks.

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-19 13:46           ` Richard Biener
@ 2018-09-28 12:48             ` Andrew Stubbs
  2018-10-01  8:05               ` Richard Biener
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-28 12:48 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Richard Sandiford

On 19/09/18 14:45, Richard Biener wrote:
> So I guess the current_vector_size thing isn't too hard to get rid of, what
> you'd end up with would be using that size when you decide for vector
> types for loads (where there are no USEs with vector types, so for example
> this would not apply to gathers).

I've finally got back to looking at this ...

My patch works because current_vector_size is only referenced in two 
places. One is passed to get_vectype_for_scalar_type_and_size, and that 
function simply calls targetm.vectorize.preferred_simd_mode when the 
requested size is zero. The other is passed to build_truth_vector_type, 
which only uses it to call targetm.vectorize.get_mask_mode, and the GCN 
backend ignores the size parameter because it only has one option. 
Presumably other backends would object to a zero size mask.

So, as I said originally, the effect is that leaving current_vector_size 
zeroed means "always ask the backend".

Pretty much everything else chains off of those places using 
get_same_sized_vectype, so ignoring current_vector_size is safe on GCN, 
and might even be safe on other architectures?

> So I'd say you want to refactor get_same_sized_vectype uses and
> make the size argument to get_vectype_for_scalar_type_and_size
> a hint only.

I've looked through the uses of get_same_sized_vectype and I've come to 
the conclusion that many of them really mean it.

For example, vectorizable_bswap tries to reinterpret a vector register 
as a byte vector so that it can permute it. This is an optimization that 
won't work on GCN (because the vector registers don't work like that), 
but seems like a valid use of the vector size characteristic of other 
architectures.

For another example, vectorizable_conversion is targeting the 
vec_pack_trunc patterns, and therefore really does want to specify the 
types. Again, this isn't something we want to do on GCN (a regular trunc 
pattern with a vector mode will work fine).

However, vectorizable_operation seems to use it to try to match the 
input and output types to the same vector unit (i.e. vector size); at 
least that's my interpretation. It returns "not vectorizable" if the 
input and output vectors have different numbers of elements. For most 
operators the lhs and rhs types will be the same, so we're all good, but 
I imagine that this code will prevent TRUNC being vectorized on GCN 
because the "same size" vector doesn't exist, and it doesn't check if 
there's a vector with the same number of elements (I've not actually 
tried that, yet, and there may be extra magic elsewhere for that case, 
but YSWIM).

I don't think changing this case to a new "get_same_length_vectype" 
would be appropriate for many architectures, so I'm not sure what to do 
here?

We could fix this with new target hooks, perhaps?

TARGET_VECTORIZE_REINTERPRET_VECTOR (vectype_in, scalartype_out)

   Returns a new vectype (or mode) that uses the same vector register as
   vectype_in, but has elements of scalartype_out.

   The default implementation would be get_same_sized_vectype.

   GCN would just return NULL, because you can't do that kind of
   optimization.

TARGET_VECTORIZE_COMPATIBLE_VECTOR (opcode, vectype_in, scalartype_out)

   Returns a new vectype (or mode) that has the right number of elements
   for the opcode (i.e. the same number, or 2x for packed opcodes), and
   elements of scalartype_out.  The backend might choose a different
   vector size, but promises that hardware can do the operation (i.e.
   it's not mixing vector units).

   The default implementation would be get_same_sized_vectype, for
   backward compatibility.

   GCN would simply return V64xx according to scalartype_out, and NULL
   for unsupported opcodes.

Of course, none of this addresses the question of which vector size to 
choose in the first place. I've not figured out how it might ever start 
with a type other than the "preferred SIMD mode", yet.

Thoughts?

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-28  8:42                   ` Richard Sandiford
@ 2018-09-28 13:50                     ` Andrew Stubbs
  2019-02-22  3:40                       ` H.J. Lu
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-28 13:50 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 28/09/18 09:11, Richard Sandiford wrote:
> Yes, thanks.

Committed.

Thanks for all the reviews. :-)

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 12/25] Make default_static_chain return NULL in non-static functions
  2018-09-17 18:55   ` Richard Sandiford
@ 2018-09-28 14:23     ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-09-28 14:23 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 17/09/18 19:55, Richard Sandiford wrote:
> Which part of the backend needs this?  I couldn't tell from a quick
> grep where the call came from.

It wasn't called directly, but from builtins.c and df-scan.c.

I needed this for GCC7, but apparently in newer source-bases the problem 
has been fixed another way.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83423

I'll drop this patch.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE
  2018-09-28 12:48             ` Andrew Stubbs
@ 2018-10-01  8:05               ` Richard Biener
  0 siblings, 0 replies; 187+ messages in thread
From: Richard Biener @ 2018-10-01  8:05 UTC (permalink / raw)
  To: Stubbs, Andrew; +Cc: GCC Patches, Richard Sandiford

On Fri, Sep 28, 2018 at 2:47 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 19/09/18 14:45, Richard Biener wrote:
> > So I guess the current_vector_size thing isn't too hard to get rid of, what
> > you'd end up with would be using that size when you decide for vector
> > types for loads (where there are no USEs with vector types, so for example
> > this would not apply to gathers).
>
> I've finally got back to looking at this ...
>
> My patch works because current_vector_size is only referenced in two
> places. One is passed to get_vectype_for_scalar_type_and_size, and that
> function simply calls targetm.vectorize.preferred_simd_mode when the
> requested size is zero. The other is passed to build_truth_vector_type,
> which only uses it to call targetm.vectorize.get_mask_mode, and the GCN
> backend ignores the size parameter because it only has one option.
> Presumably other backends would object to a zero size mask.
>
> So, as I said originally, the effect is that leaving current_vector_size
> zeroed means "always ask the backend".

Yes.

> Pretty much everything else chains off of those places using
> get_same_sized_vectype, so ignoring current_vector_size is safe on GCN,
> and might even be safe on other architectures?

Other architectures really only use it when there's a choice, like
choosing between V4SI, V8SI and V16SI on x86_64.  current_vector_size
was introduced to be able to "iterate" over supported ISAs and let the
vectorizer decide which one to use in the end (SSE vs. AVX vs. AVX512).

The value of zero is simply to give the target another chance to set
its prefered
value based on the first call.  I'd call that a bit awkward (*)

For architectures that only have a single "vector size" this variable
is really spurious and whether it is zero or non-zero doesn't make a difference.
Apart from your architecture of course where non-zero doesn't work ;)

(*) so one possibility would be to forgo with the special-value of zero
("auto-detect") and thus not change current_vector_size in
get_vectype_for_scalar_type at all.  For targets which report multiple
vector size support set current_vector_size to the prefered one in the
loop over vector sizes and for targets that do not simply keep it at zero.

> > So I'd say you want to refactor get_same_sized_vectype uses and
> > make the size argument to get_vectype_for_scalar_type_and_size
> > a hint only.
>
> I've looked through the uses of get_same_sized_vectype and I've come to
> the conclusion that many of them really mean it.
>
> For example, vectorizable_bswap tries to reinterpret a vector register
> as a byte vector so that it can permute it. This is an optimization that
> won't work on GCN (because the vector registers don't work like that),
> but seems like a valid use of the vector size characteristic of other
> architectures.

True.

> For another example, vectorizable_conversion is targeting the
> vec_pack_trunc patterns, and therefore really does want to specify the
> types. Again, this isn't something we want to do on GCN (a regular trunc
> pattern with a vector mode will work fine).
>
> However, vectorizable_operation seems to use it to try to match the
> input and output types to the same vector unit (i.e. vector size); at
> least that's my interpretation. It returns "not vectorizable" if the
> input and output vectors have different numbers of elements. For most
> operators the lhs and rhs types will be the same, so we're all good, but
> I imagine that this code will prevent TRUNC being vectorized on GCN
> because the "same size" vector doesn't exist, and it doesn't check if
> there's a vector with the same number of elements (I've not actually
> tried that, yet, and there may be extra magic elsewhere for that case,
> but YSWIM).

Yeah, we don't have a get_vector_type_for_scalar_type_and_nelems
which would probably be semantically better in many places.

> I don't think changing this case to a new "get_same_length_vectype"
> would be appropriate for many architectures, so I'm not sure what to do
> here?
>
> We could fix this with new target hooks, perhaps?
>
> TARGET_VECTORIZE_REINTERPRET_VECTOR (vectype_in, scalartype_out)
>
>    Returns a new vectype (or mode) that uses the same vector register as
>    vectype_in, but has elements of scalartype_out.
>
>    The default implementation would be get_same_sized_vectype.
>
>    GCN would just return NULL, because you can't do that kind of
>    optimization.
>
> TARGET_VECTORIZE_COMPATIBLE_VECTOR (opcode, vectype_in, scalartype_out)
>
>    Returns a new vectype (or mode) that has the right number of elements
>    for the opcode (i.e. the same number, or 2x for packed opcodes), and
>    elements of scalartype_out.  The backend might choose a different
>    vector size, but promises that hardware can do the operation (i.e.
>    it's not mixing vector units).
>
>    The default implementation would be get_same_sized_vectype, for
>    backward compatibility.
>
>    GCN would simply return V64xx according to scalartype_out, and NULL
>    for unsupported opcodes.

I don't like putting the burden on the target here too much given the vectorizer
should know what kind of constraints it has given it implements the
vectorization
on GIMPLE which as IL constraints that are to be met - we just need to
ask for vector types with the appropriate constraints rather than using
same-size everywhere.

> Of course, none of this addresses the question of which vector size to
> choose in the first place.

See above for a suggestion.

> I've not figured out how it might ever start
> with a type other than the "preferred SIMD mode", yet.

In practically all cases vect_analyze_data_refs calling
get_vectype_for_scalar_type
on a load will be the one nailing down current_vector_size (if zero).
I also cannot
quickly think of a case where that would differ from "preferred SIMD
mode" unless
the target simply lies to us here ;)

So, would a current_vector_size re-org like outlined above help you?  I agree
leaving it at zero should work unless there's code in the vectorizer
that is simply
wrong.  Addressing some GCN issues with get_vectype_for_scalar_type_and_nunits
would also OK with me (if that works).

Thanks,
Richard.

> Thoughts?
>
> Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-09-20 11:42       ` Andrew Stubbs
  2018-09-26 16:23         ` Andrew Stubbs
@ 2018-10-04 18:24         ` Jeff Law
  2018-10-11 14:28           ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-10-04 18:24 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 9/20/18 4:52 AM, Andrew Stubbs wrote:
> On 19/09/18 17:38, Andrew Stubbs wrote:
>> Here's an updated patch incorporating the RTL front-end changes. I had
>> to change from "repeated 2x" to "repeated x2" because the former is
>> not a valid C token, and apparently that's important.
> 
> Here's a patch with self tests added, for both reading and writing.
> 
> It also fixes a bug when the repeat was the last item in a list.
> 
> OK?
> 
> Andrew
> 
> 180920-elide-repeated-RTL-elements.patch
> 
> Elide repeated RTL elements.
> 
> GCN's 64-lane vectors tend to make RTL dumps very long.  This patch makes them
> far more bearable by eliding long sequences of the same element into "repeated"
> messages.
> 
> This also takes care of reading repeated sequences in the RTL front-end.
> 
> There are self tests for both reading and writing.
> 
> 2018-09-20  Andrew Stubbs  <ams@codesourcery.com>
> 	    Jan Hubicka  <jh@suse.cz>
> 	    Martin Jambor  <mjambor@suse.cz>
> 
> 	gcc/
> 	* print-rtl.c (print_rtx_operand_codes_E_and_V): Print how many times
> 	the same elements are repeated rather than printing all of them.
> 	* read-rtl.c (rtx_reader::read_rtx_operand): Recognize and expand
> 	"repeated" elements.
> 	* read-rtl-function.c (test_loading_repeat): New function.
> 	(read_rtl_function_c_tests): Call test_loading_repeat.
> 	* rtl-tests.c (test_dumping_repeat): New function.
> 	(rtl_tests_c_tests): Call test_dumping_repeat.
> 
> 	gcc/testsuite/
> 	* selftests/repeat.rtl: New file.
OK.  Thanks for fixing the reader and adding selftests.

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 04/25] SPECIAL_REGNO_P
  2018-09-13 10:03       ` Andrew Stubbs
  2018-09-13 14:14         ` Andrew Stubbs
@ 2018-10-04 19:13         ` Jeff Law
  1 sibling, 0 replies; 187+ messages in thread
From: Jeff Law @ 2018-10-04 19:13 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 9/13/18 4:01 AM, Andrew Stubbs wrote:
> 
> The register that find_rename_reg is considering is SCC, which is one of
> the "special" registers.  There is a short-cut in rename_chains for
> fixed registers, global registers, and frame pointers.  It does not
> check HARD_REGNO_RENAME_OK.
I wonder if it expects the caller to have avoided putting these
registers in the chain.


> 
> The assert is caused because the def-use chains indicate that SCC
> conflicts with itself. I suppose the question is why is it doing that,
> but it's probably do do with that being a special register that gets
> used in split2 (particularly by the addptrdi3 pattern). Although, those
> patterns are careful to save SCC to one side and then restore it again
> after, so I'd have thought the DF analysis would work out?
If you have SCC before its first set, then DF is going to think the SCC
register is live at function entry.

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 15/25] Don't double-count early-clobber matches.
  2018-09-27 22:54     ` Andrew Stubbs
@ 2018-10-04 22:43       ` Richard Sandiford
  2018-10-22 15:36         ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Sandiford @ 2018-10-04 22:43 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: gcc-patches

Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 10:18, Richard Sandiford wrote:
>> The idea looks good to me FWIW, but you can't use curr_static_id for
>> the state, since that's a static description of the .md pattern rather
>> than data about this particular instance.
>
> I clearly misunderstood what that was for.
>
> This patch does the same thing, but uses a local variable to store the 
> state. That probably means it does it more correctly, too.
>
> OK?
>
> Andrew
>
> Don't double-count early-clobber matches.
>
> Given a pattern with a number of operands:
>
> (match_operand 0 "" "=&v")
> (match_operand 1 "" " v0")
> (match_operand 2 "" " v0")
> (match_operand 3 "" " v0")
>
> GCC will currently increment "reject" once, for operand 0, and then decrement
> it once for each of the other operands, ending with reject == -2 and an
> assertion failure.  If there's a conflict then it might try to decrement reject
> yet again.
>
> Incidentally, what these patterns are trying to achieve is an allocation in
> which operand 0 may match one of the other operands, but may not partially
> overlap any of them.  Ideally there'd be a better way to do this.
>
> In any case, it will affect any pattern in which multiple operands may (or
> must) match an early-clobber operand.
>
> The patch only allows a reject-- when one has not already occurred, for that
> operand.
>
> 2018-09-27  Andrew Stubbs  <ams@codesourcery.com>
>
> 	gcc/
> 	* lra-constraints.c (process_alt_operands): Check
> 	matching_early_clobber before decrementing reject, and set
> 	matching_early_clobber after.
> 	* lra-int.h (struct lra_operand_data): Add matching_early_clobber.
> 	* lra.c (setup_operand_alternative): Initialize matching_early_clobber.
>
> diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
> index 774d1ff..e1d1688 100644
> --- a/gcc/lra-constraints.c
> +++ b/gcc/lra-constraints.c
> @@ -1969,6 +1969,7 @@ process_alt_operands (int only_alternative)
>        if (!TEST_BIT (preferred, nalt))
>  	continue;
>  
> +      bool matching_early_clobber[MAX_RECOG_OPERANDS] = {};

This is potentially expensive, since MAX_RECOG_OPERANDS >= 30 and
most instructions have operand counts in the low single digits.
(And this is a very compile-time sensitive function -- it often
shows up at the top or near the top of a "cc1 -O0" profile.)

How about clearing it in this loop:

      curr_small_class_check++;
      overall = losers = addr_losers = 0;
      static_reject = reject = reload_nregs = reload_sum = 0;
      for (nop = 0; nop < n_operands; nop++)
	{
	  ...
	}

OK with that change if it works, thanks.

Sorry for the slow reply...

Richard

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 09/25] Elide repeated RTL elements.
  2018-10-04 18:24         ` Jeff Law
@ 2018-10-11 14:28           ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-10-11 14:28 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 04/10/18 19:12, Jeff Law wrote:
> OK.  Thanks for fixing the reader and adding selftests.

Thanks, committed.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 15/25] Don't double-count early-clobber matches.
  2018-10-04 22:43       ` Richard Sandiford
@ 2018-10-22 15:36         ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-10-22 15:36 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 134 bytes --]

On 04/10/2018 21:39, Richard Sandiford wrote:
> OK with that change if it works, thanks.

Thanks, here's what I've committed.

Andrew

[-- Attachment #2: 181022-early-clobber-matches.patch --]
[-- Type: text/x-patch, Size: 3553 bytes --]

Don't double-count early-clobber matches.

Given a pattern with a number of operands:

(match_operand 0 "" "=&v")
(match_operand 1 "" " v0")
(match_operand 2 "" " v0")
(match_operand 3 "" " v0")

GCC will currently increment "reject" once, for operand 0, and then decrement
it once for each of the other operands, ending with reject == -2 and an
assertion failure.  If there's a conflict then it might try to decrement reject
yet again.

Incidentally, what these patterns are trying to achieve is an allocation in
which operand 0 may match one of the other operands, but may not partially
overlap any of them.  Ideally there'd be a better way to do this.

In any case, it will affect any pattern in which multiple operands may (or
must) match an early-clobber operand.

The patch only allows a reject-- when one has not already occurred, for that
operand.

2018-10-22  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* lra-constraints.c (process_alt_operands): New local array,
	matching_early_clobber.  Check matching_early_clobber before
	decrementing reject, and set matching_early_clobber after.

diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 774d1ff..3b355a8 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -1969,6 +1969,7 @@ process_alt_operands (int only_alternative)
       if (!TEST_BIT (preferred, nalt))
 	continue;
 
+      bool matching_early_clobber[MAX_RECOG_OPERANDS];
       curr_small_class_check++;
       overall = losers = addr_losers = 0;
       static_reject = reject = reload_nregs = reload_sum = 0;
@@ -1980,6 +1981,7 @@ process_alt_operands (int only_alternative)
 	    fprintf (lra_dump_file,
 		     "            Staticly defined alt reject+=%d\n", inc);
 	  static_reject += inc;
+	  matching_early_clobber[nop] = 0;
 	}
       reject += static_reject;
       early_clobbered_regs_num = 0;
@@ -2175,7 +2177,11 @@ process_alt_operands (int only_alternative)
 				 "            %d Matching earlyclobber alt:"
 				 " reject--\n",
 				 nop);
-			    reject--;
+			    if (!matching_early_clobber[m])
+			      {
+				reject--;
+				matching_early_clobber[m] = 1;
+			      }
 			  }
 			/* Otherwise we prefer no matching
 			   alternatives because it gives more freedom
@@ -2921,15 +2927,11 @@ process_alt_operands (int only_alternative)
 	      curr_alt_dont_inherit_ops[curr_alt_dont_inherit_ops_num++]
 		= last_conflict_j;
 	      losers++;
-	      /* Early clobber was already reflected in REJECT. */
-	      lra_assert (reject > 0);
 	      if (lra_dump_file != NULL)
 		fprintf
 		  (lra_dump_file,
 		   "            %d Conflict early clobber reload: reject--\n",
 		   i);
-	      reject--;
-	      overall += LRA_LOSER_COST_FACTOR - 1;
 	    }
 	  else
 	    {
@@ -2953,17 +2955,21 @@ process_alt_operands (int only_alternative)
 		}
 	      curr_alt_win[i] = curr_alt_match_win[i] = false;
 	      losers++;
-	      /* Early clobber was already reflected in REJECT. */
-	      lra_assert (reject > 0);
 	      if (lra_dump_file != NULL)
 		fprintf
 		  (lra_dump_file,
 		   "            %d Matched conflict early clobber reloads: "
 		   "reject--\n",
 		   i);
+	    }
+	  /* Early clobber was already reflected in REJECT. */
+	  if (!matching_early_clobber[i])
+	    {
+	      lra_assert (reject > 0);
 	      reject--;
-	      overall += LRA_LOSER_COST_FACTOR - 1;
+	      matching_early_clobber[i] = 1;
 	    }
+	  overall += LRA_LOSER_COST_FACTOR - 1;
 	}
       if (lra_dump_file != NULL)
 	fprintf (lra_dump_file, "          alt=%d,overall=%d,losers=%d,rld_nregs=%d\n",

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 20/25] GCN libgcc.
  2018-09-05 11:52 ` [PATCH 20/25] GCN libgcc ams
  2018-09-05 12:32   ` Joseph Myers
@ 2018-11-09 18:49   ` Jeff Law
  2018-11-12 12:01     ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-11-09 18:49 UTC (permalink / raw)
  To: ams, gcc-patches

On 9/5/18 5:52 AM, ams@codesourcery.com wrote:
> This patch contains the GCN port of libgcc.  I've broken it out just to keep
> both parts more manageable.
> 
> We have the usual stuff, plus a "gomp_print" implementation intended to provide
> a means to output text to console without using the full printf.  Originally
> this was because we did not have a working Newlib port, but now it provides the
> underlying mechanism for printf.  It's also much lighter than printf, and
> therefore more suitable for debugging offload kernels (for which there is no
> debugger, yet).
> 
> In order to work in offload kernels the same function must be present in both
> host and GCN toolchains.  Therefore it needs to live in libgomp (hence the
> name).  However, having found it also useful in stand alone testing I have
> moved the GCN implementation to libgcc.
> 
> It was also necessary to provide a means to disable EMUTLS.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
> 	    Kwok Cheung Yeung  <kcy@codesourcery.com>
> 	    Julian Brown  <julian@codesourcery.com>
> 	    Tom de Vries  <tom@codesourcery.com>
> 
> 	libgcc/
> 	* Makefile.in: Don't add emutls.c when --enable-emutls is "no".
> 	* config.host: Recognize amdgcn*-*-amdhsa.
> 	* config/gcn/crt0.c: New file.
> 	* config/gcn/gomp_print.c: New file.
> 	* config/gcn/lib2-divmod-hi.c: New file.
> 	* config/gcn/lib2-divmod.c: New file.
> 	* config/gcn/lib2-gcn.h: New file.
> 	* config/gcn/reduction.c: New file.
> 	* config/gcn/sfp-machine.h: New file.
> 	* config/gcn/t-amdgcn: New file.
> ---
> 
> 
> 0020-GCN-libgcc.patch
> 
> diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
> index 0c5b264..6f68257 100644
> --- a/libgcc/Makefile.in
> +++ b/libgcc/Makefile.in
> @@ -429,9 +429,11 @@ LIB2ADD += enable-execute-stack.c
>  # While emutls.c has nothing to do with EH, it is in LIB2ADDEH*
>  # instead of LIB2ADD because that's the way to be sure on some targets
>  # (e.g. *-*-darwin*) only one copy of it is linked.
> +ifneq ($(enable_emutls),no)
>  LIB2ADDEH += $(srcdir)/emutls.c
>  LIB2ADDEHSTATIC += $(srcdir)/emutls.c
>  LIB2ADDEHSHARED += $(srcdir)/emutls.c
> +endif
Why is this needed? Are you just trying to cut out stuff you don't need
in the quest for smaller code or does this cause a more direct problem?


> diff --git a/libgcc/config/gcn/crt0.c b/libgcc/config/gcn/crt0.c
> new file mode 100644
> index 0000000..f4f367b
> --- /dev/null
> +++ b/libgcc/config/gcn/crt0.c
> @@ -0,0 +1,23 @@
> +/* Copyright (C) 2017 Free Software Foundation, Inc.
> +
> +   This file is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by the
> +   Free Software Foundation; either version 3, or (at your option) any
> +   later version.
> +
> +   This file is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   Under Section 7 of GPL version 3, you are granted additional
> +   permissions described in the GCC Runtime Library Exception, version
> +   3.1, as published by the Free Software Foundation.
> +
> +   You should have received a copy of the GNU General Public License and
> +   a copy of the GCC Runtime Library Exception along with this program;
> +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Provide an entry point symbol to silence a linker warning.  */
> +void _start() {}
This seems wrong.   I realize you're trying to quiet a linker warning
here, but for the case where you're creating GCN executables (testing?)
this should probably be handled by the C-runtime or linker script.


> diff --git a/libgcc/config/gcn/gomp_print.c b/libgcc/config/gcn/gomp_print.c
> new file mode 100644
> index 0000000..41f50c3
> --- /dev/null
> +++ b/libgcc/config/gcn/gomp_print.c
[ ... ]
Would this be better in libgomp?  Oh, you addressed that in the
prologue.  Feels like libgomp would be better to me, but I can
understand the rationale behind wanting it in libgcc.


I won't comment on the static sizes since this apparently has to match
something already in existence.



> +
> +void
> +gomp_print_string (const char *msg, const char *value)
> +{
> +  struct printf_data *output = reserve_print_slot ();
> +  output->type = 2; /* String.  */
> +
> +  strncpy (output->msg, msg, 127);
> +  output->msg[127] = '\0';
> +  strncpy (output->text, value, 127);
> +  output->text[127] = '\0';
> +
> +  asm ("" ::: "memory");
> +  output->written = 1;
> +}
I'm not familiar with the GCN memory model, but your asm is really just
a barrier for the compiler.  Do you need any kind of hardware fencing
here?  Similarly for other instances.

All these functions probably need a little comment on their purpose and
arguments.

Note some of the divmod stuff recently changed.  You may need minor
updates as a result of those patches.  See:

2018-10-18  Paul Koning  <ni1d@arrl.net>

        * udivmodsi4.c (__udivmodsi4): Rename to conform to coding
        standard.
        * divmod.c: Update references to __udivmodsi4.
        * udivmod.c: Ditto.
        * udivhi3.c: New file.
        * udivmodhi4.c: New file.
        * config/pdp11/t-pdp11 (LIB2ADD): Add the new files.


Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 1/2).
  2018-09-05 13:40 ` [PATCH 21/25] GCN Back-end (part 1/2) Andrew Stubbs
@ 2018-11-09 19:11   ` Jeff Law
  2018-11-12 12:13     ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-11-09 19:11 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 9/5/18 7:40 AM, Andrew Stubbs wrote:
> This part initially failed to send due to size.
> 
> This is the main portion of the GCN back-end, plus the configuration
> adjustments needed to build it.
> 
> The config.sub patch is here so people can try it, but I'm aware that
> needs to
> be committed elsewhere first.
> 
> The back-end contains various bits that support OpenACC and OpenMP, but the
> middle-end and libgomp patches are missing.  I included them here because
> they're harmless and carving up the files seems like unnecessary effort.
>  The
> remaining offload support will be posted at a later date.
> 
> The gcn-run.c is a separate tool that can run a GCN program on a GPU using
> the ROCm drivers and HSA runtime libraries.
> 
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>> .......    Kwok Cheung Yeung  <kcy@codesourcery.com>
>> .......    Julian Brown  <julian@codesourcery.com>
>> .......    Tom de Vries  <tom@codesourcery.com>
>> .......    Jan Hubicka  <hubicka@ucw.cz>
>> .......    Martin Jambor  <mjambor@suse.cz>
> 
>> .......* config.sub: Recognize amdgcn*-*-amdhsa.
>> .......* configure.ac: Likewise.
>> .......* configure: Regenerate.
> 
>> .......gcc/
>> .......* common/config/gcn/gcn-common.c: New file.
>> .......* config.gcc: Add amdgcn*-*-amdhsa configuration.
>> .......* config/gcn/constraints.md: New file.
>> .......* config/gcn/driver-gcn.c: New file.
>> .......* config/gcn/gcn-builtins.def: New file.
>> .......* config/gcn/gcn-hsa.h: New file.
>> .......* config/gcn/gcn-modes.def: New file.
>> .......* config/gcn/gcn-opts.h: New file.
>> .......* config/gcn/gcn-passes.def: New file.
>> .......* config/gcn/gcn-protos.h: New file.
>> .......* config/gcn/gcn-run.c: New file.
>> .......* config/gcn/gcn-tree.c: New file.
>> .......* config/gcn/gcn-valu.md: New file.
>> .......* config/gcn/gcn.c: New file.
>> .......* config/gcn/gcn.h: New file.
>> .......* config/gcn/gcn.md: New file.
>> .......* config/gcn/gcn.opt: New file.
>> .......* config/gcn/mkoffload.c: New file.
>> .......* config/gcn/offload.h: New file.
>> .......* config/gcn/predicates.md: New file.
>> .......* config/gcn/t-gcn-hsa: New file.
> 
> 0021-gcn-port-pt1.patch
> 

> +amdgcn-*-amdhsa)
> +	tm_file="dbxelf.h elfos.h gcn/gcn-hsa.h gcn/gcn.h newlib-stdint.h"
Please consider killing dbxelf.h :-)  I assume your default debugging
format is dwarf2, but do you really need to support embedded stabs?



> +
> +/* FIXME: review debug info settings */
> +#define PREFERRED_DEBUGGING_TYPE   DWARF2_DEBUG
> +#define DWARF2_DEBUGGING_INFO      1
> +#define DWARF2_ASM_LINE_DEBUG_INFO 1
> +#define EH_FRAME_THROUGH_COLLECT2  1
These look reasonable.  Essentially you're doing dwarf2 by default.
Maybe just look at EH_FRAME_THROUGH_COLLECT2 more closely to make sure
it still makes sense and isn't a remnant of early port hackery to get
things stumbling along.


> diff --git a/gcc/config/gcn/gcn-run.c b/gcc/config/gcn/gcn-run.c
> new file mode 100644
> index 0000000..3dea343
> --- /dev/null
> +++ b/gcc/config/gcn/gcn-run.c
I'm going to assume this is largely correct.  It looks like all the glue
code to run kernels on the unit.  It loads the code to be run AFACIT, so
it doesn't need an exception clause as it's not linked against the code
that is to be run IIUC.



> diff --git a/gcc/config/gcn/gcn-tree.c b/gcc/config/gcn/gcn-tree.c
> new file mode 100644
> index 0000000..0365baf
> --- /dev/null
> +++ b/gcc/config/gcn/gcn-tree.c
> @@ -0,0 +1,715 @@
> +/* Copyright (C) 2017-2018 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +   
> +   GCC is free software; you can redistribute it and/or modify it under
> +   the terms of the GNU General Public License as published by the Free
> +   Software Foundation; either version 3, or (at your option) any later
> +   version.
> +   
> +   GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +   WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +   for more details.
> +   
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* {{{ Includes.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "target.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "tree-pass.h"
> +#include "gimple-iterator.h"
> +#include "cfghooks.h"
> +#include "cfgloop.h"
> +#include "tm_p.h"
> +#include "stringpool.h"
> +#include "fold-const.h"
> +#include "varasm.h"
> +#include "omp-low.h"
> +#include "omp-general.h"
> +#include "internal-fn.h"
> +#include "tree-vrp.h"
> +#include "tree-ssanames.h"
> +#include "tree-ssa-operands.h"
> +#include "gimplify.h"
> +#include "tree-phinodes.h"
> +#include "cgraph.h"
> +#include "targhooks.h"
> +#include "langhooks-def.h"
> +
> +/* }}}  */
> +/* {{{ OMP GCN pass.  */
> +
> +unsigned int
> +execute_omp_gcn (void)
So some documentation about what this pass is supposed to be doing would
be helpful in the future if anyone needs to change it.



There's a ton of work related to reduction setup, updates and teardown.
 I don't guess there's any generic code we can/should be re-using.  Sigh.


> diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
> new file mode 100644
> index 0000000..0531c4f
> --- /dev/null
> +++ b/gcc/config/gcn/gcn-valu.md
> +
> +    if (can_create_pseudo_p ())
> +      {
> +        rtx exec = gcn_full_exec_reg ();
> +	rtx undef = gcn_gen_undef (<MODE>mode);
Looks like tabs-vs-spaces problem in here.  It's a nit obviously.  Might
as well fix it now and go a global search and replace in the other gcn
files so they're right from day 1.

WRT your move patterns.  I'm a bit concerned about using distinct
matters for so many different variants.  But they mostly seem confined
to vector variants.  Be aware you may need to squash them into a single
pattern over time to keep LRA happy.

Nothing looks too bad here...

jeff


^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-09-05 13:43 ` [PATCH 21/25] GCN Back-end (part 2/2) Andrew Stubbs
  2018-09-05 14:22   ` Joseph Myers
@ 2018-11-09 19:40   ` Jeff Law
  2018-11-12 12:53     ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-11-09 19:40 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 9/5/18 7:42 AM, Andrew Stubbs wrote:
> This part initially failed to send due to size.
> 
> Here's part 2.
> 
> 0021-gcn-port-pt2.patch
[ ... ]
You've already addressed Joseph's comments in a follow-up.


> 
> diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
> new file mode 100644
> index 0000000..7e59b06
> --- /dev/null
> +++ b/gcc/config/gcn/gcn.c
> @@ -0,0 +1,6161 @@
> +/* Copyright (C) 2016-2018 Free Software Foundation, Inc.
> +
> +   This file is free software; you can redistribute it and/or modify it under
> +   the terms of the GNU General Public License as published by the Free
> +   Software Foundation; either version 3 of the License, or (at your option)
> +   any later version.
> +
> +   This file is distributed in the hope that it will be useful, but WITHOUT
> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +   for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* {{{ Includes.  */
> +
> +/* We want GET_MODE_SIZE et al to return integers, please.  */

> +
> +static tree
> +gcn_handle_amdgpu_hsa_kernel_attribute (tree *node, tree name,
> +					tree args, int, bool *no_add_attrs)
> +{
> +  if (TREE_CODE (*node) != FUNCTION_TYPE
> +      && TREE_CODE (*node) != METHOD_TYPE
> +      && TREE_CODE (*node) != METHOD_TYPE
> +      && TREE_CODE (*node) != FIELD_DECL
> +      && TREE_CODE (*node) != TYPE_DECL)
METHOD_TYPE tested twice here.  Might as well use FUNC_OR_METHOD_TYPE_P.


>> +
> +/* Return true is REG is a valid place to store a pointer,
> +   for instructions that require an SGPR.
> +   FIXME rename. */
> +
> +static bool
> +gcn_address_register_p (rtx reg, machine_mode mode, bool strict)
> +{
> +  if (GET_CODE (reg) == SUBREG)
> +    reg = SUBREG_REG (reg);
> +
> +  if (!REG_P (reg))
> +    return false;
> +
> +  if (GET_MODE (reg) != mode)
> +    return false;
> +
> +  int regno = REGNO (reg);
> +
> +  if (regno >= FIRST_PSEUDO_REGISTER)
> +    {
> +      if (!strict)
> +	return true;
> +
> +      if (!reg_renumber)
> +	return false;
> +
> +      regno = reg_renumber[regno];
> +    }
> +
> +  return (regno < 102 || regno == M0_REG
> +	  || regno == ARG_POINTER_REGNUM || regno == FRAME_POINTER_REGNUM);
Consider using a symbolic name for "102" :-)



> +
> +/* Generate epilogue.  Called from gen_epilogue during pro_and_epilogue pass.
> +
> +   See gcn_expand_prologue for stack details.  */
> +
> +void
> +gcn_expand_epilogue (void)
You probably need a barrier in here to ensure that the scheduler doesn't
move an aliased memory reference into the local stack beyond the stack
adjustment.

You're less likely to run into it because you eliminate frame pointers
fairly aggressively, but it's still the right thing to do.

> +
> +/* Implement TARGET_LEGITIMATE_COMBINED_INSN.
> +
> +   Return false if the instruction is not appropriate as a combination of two
> +   or more instructions.  */
> +
> +bool
> +gcn_legitimate_combined_insn (rtx_insn *insn)
> +{
> +  rtx pat = PATTERN (insn);
> +
> +  /* The combine pass tends to strip (use (exec)) patterns from insns.  This
> +     means it basically switches everything to use the *_scalar form of the
> +     instructions, which is not helpful.  So, this function disallows such
> +     combinations.  Unfortunately, this also disallows combinations of genuine
> +     scalar-only patterns, but those only come from explicit expand code.
> +
> +     Possible solutions:
> +     - Invent TARGET_LEGITIMIZE_COMBINED_INSN.
> +     - Remove all (use (EXEC)) and rely on md_reorg with "exec" attribute.
> +   */
This seems a bit hokey.  Why specifically is combine removing the USE?



> +
> +/* If INSN sets the EXEC register to a constant value, return the value,
> +   otherwise return zero.  */
> +
> +static
> +int64_t gcn_insn_exec_value (rtx_insn *insn)
Nit.  Make sure the function's name is in column 0 by moving the return
type to the previous line.


> +{
> +  if (!NONDEBUG_INSN_P (insn))
> +    return 0;
> +
> +  rtx pattern = PATTERN (insn);
> +
> +  if (GET_CODE (pattern) == SET)
> +    {
> +      rtx dest = XEXP (pattern, 0);
> +      rtx src = XEXP (pattern, 1);
> +
> +      if (GET_MODE (dest) == DImode
> +	  && REG_P (dest) && REGNO (dest) == EXEC_REG
> +	  && CONST_INT_P (src))
> +	return INTVAL (src);
> +    }
> +
> +  return 0;
> +}
> +
> +/* Sets the EXEC register before INSN to the value that it had after
> +   LAST_EXEC_DEF.  The constant value of the EXEC register is returned if
> +   known, otherwise it returns zero.  */
> +
> +static
> +int64_t gcn_restore_exec (rtx_insn *insn, rtx_insn *last_exec_def,
> +			  int64_t curr_exec, bool curr_exec_known,
> +			  bool &last_exec_def_saved)
Similarly.  Probably worth a check through all the code to catch any
other similar nits.




> +
> +  CLEAR_REG_SET (&live);
> +
> +  /* "Manually Inserted Wait States (NOPs)."
> +   
> +     GCN hardware detects most kinds of register dependencies, but there
> +     are some exceptions documented in the ISA manual.  This pass
> +     detects the missed cases, and inserts the documented number of NOPs
> +     required for correct execution.  */
How unpleasant :(  But if it's what you need to do, so be it.  I'll
assume the compiler is the right place to do this -- though some ports
handle this kind of stuff in the assembler or linker.





> diff --git a/gcc/config/gcn/gcn.h b/gcc/config/gcn/gcn.h
> new file mode 100644
> index 0000000..74f0773
> --- /dev/null
> +++ b/gcc/config/gcn/gcn.h
[ ... ]

> +/* Disable the "current_vector_size" feature intended for
> +   AVX<->SSE switching.  */
Guessing you just copied the comment, you probably want to update it to
not refer to AVX/SSE.

You probably need to define the safe-speculation stuff
(TARGET_SPECULATION_SAFE_VALUE).



> +
> +; "addptr" is the same as "add" except that it must not write to VCC or SCC
> +; as a side-effect.  Unfortunately GCN3 does not have a suitable instruction
> +; for this, so we use a split to save and restore the condition code.
> +; This pattern must use "Sg" instead of "SD" to prevent the compiler
> +; assigning VCC as the destination.
> +; FIXME: Provide GCN5 implementation
I worry about the save/restore aspects of this.  Haven't we discussed
this somewhere?!?




Generally I don't see major concerns.   THere's some minor things to
fix.  As far as the correctness of the code you're generating, well, I'm
going have to assume you've got that right and will own addressing bugs
in that space.

My inclination would be to have this go forward into gcc-9 as the minor
issues are wrapped up.

jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 20/25] GCN libgcc.
  2018-11-09 18:49   ` Jeff Law
@ 2018-11-12 12:01     ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-12 12:01 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 09/11/2018 18:48, Jeff Law wrote:
>> diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
>> index 0c5b264..6f68257 100644
>> --- a/libgcc/Makefile.in
>> +++ b/libgcc/Makefile.in
>> @@ -429,9 +429,11 @@ LIB2ADD += enable-execute-stack.c
>>   # While emutls.c has nothing to do with EH, it is in LIB2ADDEH*
>>   # instead of LIB2ADD because that's the way to be sure on some targets
>>   # (e.g. *-*-darwin*) only one copy of it is linked.
>> +ifneq ($(enable_emutls),no)
>>   LIB2ADDEH += $(srcdir)/emutls.c
>>   LIB2ADDEHSTATIC += $(srcdir)/emutls.c
>>   LIB2ADDEHSHARED += $(srcdir)/emutls.c
>> +endif
> Why is this needed? Are you just trying to cut out stuff you don't need
> in the quest for smaller code or does this cause a more direct problem?

This dates back to when that code wouldn't compile. It also surprised me 
that --disable-emutls didn't do it (but this stuff is long ago now, so I 
don't recall the details of that).

Anyway, the code compiles now, so I can remove this hunk.

>> +/* Provide an entry point symbol to silence a linker warning.  */
>> +void _start() {}
> This seems wrong.   I realize you're trying to quiet a linker warning
> here, but for the case where you're creating GCN executables (testing?)
> this should probably be handled by the C-runtime or linker script.

We're using an LLVM linker, so I'd rather fix things here than there.

Anyway, I plan to make this a proper kernel and use it to run static 
constructors, one day. Possibly it should be in Newlib, but then the 
"ctors" code is found in crtstuff and libgcc2, so I don't know?

>> diff --git a/libgcc/config/gcn/gomp_print.c b/libgcc/config/gcn/gomp_print.c
>> new file mode 100644
>> index 0000000..41f50c3
>> --- /dev/null
>> +++ b/libgcc/config/gcn/gomp_print.c
> [ ... ]
> Would this be better in libgomp?  Oh, you addressed that in the
> prologue.  Feels like libgomp would be better to me, but I can
> understand the rationale behind wanting it in libgcc.

Now that printf works, possibly it should be moved back. There's no 
debugger for this target, so these routines are my usual means for 
debugging stuff, and libgomp isn't built in the config used to run the 
testsuite.

> I won't comment on the static sizes since this apparently has to match
> something already in existence.

Yeah, this is basically a shared memory interface. I plan to implement a 
proper circular buffer, etc., etc., etc., but there's a lot to do.

>> +
>> +void
>> +gomp_print_string (const char *msg, const char *value)
>> +{
>> +  struct printf_data *output = reserve_print_slot ();
>> +  output->type = 2; /* String.  */
>> +
>> +  strncpy (output->msg, msg, 127);
>> +  output->msg[127] = '\0';
>> +  strncpy (output->text, value, 127);
>> +  output->text[127] = '\0';
>> +
>> +  asm ("" ::: "memory");
>> +  output->written = 1;
>> +}
> I'm not familiar with the GCN memory model, but your asm is really just
> a barrier for the compiler.  Do you need any kind of hardware fencing
> here?  Similarly for other instances.

As long as the compiler doesn't reorder the write instructions then this 
is fine, as is. The architecture does not reorder writes in hardware.

That said, actually the updated version I'm preparing has additional L1 
cache flushes to make absolutely sure the data are written to the L2 
cache memory in order. I did this when investigating a problem to make 
sure I wasn't losing debug output, and even though I found that I was 
not, I kept the patch anyway.

> All these functions probably need a little comment on their purpose and
> arguments.

Understood.

> Note some of the divmod stuff recently changed.  You may need minor
> updates as a result of those patches.  See:

OK, thanks for the heads up.

And thanks for the review. I'm planning to post a somewhat-updated V2 
patch set any week now.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 1/2).
  2018-11-09 19:11   ` Jeff Law
@ 2018-11-12 12:13     ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-12 12:13 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 09/11/2018 19:11, Jeff Law wrote:
> There's a ton of work related to reduction setup, updates and teardown.
>   I don't guess there's any generic code we can/should be re-using.  Sigh.

I'm not sure what can be shared, or not, here. For OpenMP we don't have 
any special code, but OpenACC is much closer to the metal, and AMD GCN 
does things somewhat differently to NVPTX.

> WRT your move patterns.  I'm a bit concerned about using distinct
> matters for so many different variants.  But they mostly seem confined
> to vector variants.  Be aware you may need to squash them into a single
> pattern over time to keep LRA happy.

As you might guess, the move patterns have been really difficult to get 
right. The added dependency on the EXEC register tends to put LRA into 
an infinite loop, and the fact that GCN vector moves are always 
scatter/gather (rather than a contiguous load/store from a base address) 
makes spills rather painful.

Thanks for your review, I'll have a V2 patch-set soonish.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-09 19:40   ` Jeff Law
@ 2018-11-12 12:53     ` Andrew Stubbs
  2018-11-12 17:20       ` Segher Boessenkool
  2018-11-14 22:31       ` Jeff Law
  0 siblings, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-12 12:53 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 09/11/2018 19:39, Jeff Law wrote:
>> +
>> +/* Generate epilogue.  Called from gen_epilogue during pro_and_epilogue pass.
>> +
>> +   See gcn_expand_prologue for stack details.  */
>> +
>> +void
>> +gcn_expand_epilogue (void)
> You probably need a barrier in here to ensure that the scheduler doesn't
> move an aliased memory reference into the local stack beyond the stack
> adjustment.
> 
> You're less likely to run into it because you eliminate frame pointers
> fairly aggressively, but it's still the right thing to do.

Sorry, I'm not sure I understand what the problem is? How can this 
happen? Surely the scheduler wouldn't change the logic of the code?

>> +
>> +/* Implement TARGET_LEGITIMATE_COMBINED_INSN.
>> +
>> +   Return false if the instruction is not appropriate as a combination of two
>> +   or more instructions.  */
>> +
>> +bool
>> +gcn_legitimate_combined_insn (rtx_insn *insn)
>> +{
>> +  rtx pat = PATTERN (insn);
>> +
>> +  /* The combine pass tends to strip (use (exec)) patterns from insns.  This
>> +     means it basically switches everything to use the *_scalar form of the
>> +     instructions, which is not helpful.  So, this function disallows such
>> +     combinations.  Unfortunately, this also disallows combinations of genuine
>> +     scalar-only patterns, but those only come from explicit expand code.
>> +
>> +     Possible solutions:
>> +     - Invent TARGET_LEGITIMIZE_COMBINED_INSN.
>> +     - Remove all (use (EXEC)) and rely on md_reorg with "exec" attribute.
>> +   */
> This seems a bit hokey.  Why specifically is combine removing the USE?

I don't understand combine fully enough to explain it now, although at 
the time I wrote this, and in a GCC 7 code base, I had followed the code 
through and observed what it was doing.

Basically, if you have two patterns that do the same operation, but one 
has a "parallel" with an additional "use", then combine will tend to 
prefer the one without the "use". That doesn't stop the code working, 
but it makes a premature (accidental) decision about instruction 
selection that we'd prefer to leave to the register allocator.

I don't recall if it did this to lone instructions, but it would 
certainly do so when combining two (or more) instructions, and IIRC 
there are typically plenty of simple moves around that can be easily 
combined.

>> +  /* "Manually Inserted Wait States (NOPs)."
>> +
>> +     GCN hardware detects most kinds of register dependencies, but there
>> +     are some exceptions documented in the ISA manual.  This pass
>> +     detects the missed cases, and inserts the documented number of NOPs
>> +     required for correct execution.  */
> How unpleasant :(  But if it's what you need to do, so be it.  I'll
> assume the compiler is the right place to do this -- though some ports
> handle this kind of stuff in the assembler or linker.

We're using an LLVM assembler and linker, so we have tried to use them 
as is, rather than making parallel changes that would prevent GCC 
working with the last numbered release of LLVM (see the work around for 
assembler bugs in the BImode mode instruction).

Expecting the assembler to fix this up would also throw off the 
compiler's offset calculations, and the near/far branch instructions 
have different register requirements it's better for the compiler to 
know about.

The MIPS backend also inserts NOPs in a similar way.

In future, I'd like to have the scheduler insert real instructions into 
these slots, but that's very much on the to-do list.

>> +/* Disable the "current_vector_size" feature intended for
>> +   AVX<->SSE switching.  */
> Guessing you just copied the comment, you probably want to update it to
> not refer to AVX/SSE.

Nope, that means exactly what it says. See the (unresolved) discussion 
around "[PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE".

I'll probably move that into a separate patch to commit after the main 
port. It'll suffer poor vectorization in some examples in the mean-time, 
but that patch is not going to be straight-forward.

> You probably need to define the safe-speculation stuff
> (TARGET_SPECULATION_SAFE_VALUE).

Oh, OK. :-(

I have no idea whether the architecture has those issues or not.

>> +; "addptr" is the same as "add" except that it must not write to VCC or SCC
>> +; as a side-effect.  Unfortunately GCN3 does not have a suitable instruction
>> +; for this, so we use a split to save and restore the condition code.
>> +; This pattern must use "Sg" instead of "SD" to prevent the compiler
>> +; assigning VCC as the destination.
>> +; FIXME: Provide GCN5 implementation
> I worry about the save/restore aspects of this.  Haven't we discussed
> this somewhere?!?

I think this came up in the SPECIAL_REGNO_P patch discussion. We 
eventually found that the underlying problem was the way the 
save/restore reused pseudoregs.

The "addptr" pattern has been rewritten in my draft V2 patchset. It 
still uses a fixed scratch register, but no longer does save/restore.

> Generally I don't see major concerns.   THere's some minor things to
> fix.  As far as the correctness of the code you're generating, well, I'm
> going have to assume you've got that right and will own addressing bugs
> in that space.

Agreed. The bare-machine testsuite runs pretty well, as does the libgomp 
testsuite in an offloading toolchain.

> My inclination would be to have this go forward into gcc-9 as the minor
> issues are wrapped up.

Thanks for your review, I'll have a V2 patch-set soonish.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-12 12:53     ` Andrew Stubbs
@ 2018-11-12 17:20       ` Segher Boessenkool
  2018-11-12 17:52         ` Andrew Stubbs
  2018-11-14 22:31       ` Jeff Law
  1 sibling, 1 reply; 187+ messages in thread
From: Segher Boessenkool @ 2018-11-12 17:20 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: Jeff Law, gcc-patches

On Mon, Nov 12, 2018 at 12:53:26PM +0000, Andrew Stubbs wrote:
> >>+/* Implement TARGET_LEGITIMATE_COMBINED_INSN.
> >>+
> >>+   Return false if the instruction is not appropriate as a combination 
> >>of two
> >>+   or more instructions.  */
> >>+
> >>+bool
> >>+gcn_legitimate_combined_insn (rtx_insn *insn)
> >>+{
> >>+  rtx pat = PATTERN (insn);
> >>+
> >>+  /* The combine pass tends to strip (use (exec)) patterns from insns.  
> >>This
> >>+     means it basically switches everything to use the *_scalar form of 
> >>the
> >>+     instructions, which is not helpful.  So, this function disallows 
> >>such
> >>+     combinations.  Unfortunately, this also disallows combinations of 
> >>genuine
> >>+     scalar-only patterns, but those only come from explicit expand code.
> >>+
> >>+     Possible solutions:
> >>+     - Invent TARGET_LEGITIMIZE_COMBINED_INSN.
> >>+     - Remove all (use (EXEC)) and rely on md_reorg with "exec" 
> >>attribute.
> >>+   */
> >This seems a bit hokey.  Why specifically is combine removing the USE?
> 
> I don't understand combine fully enough to explain it now, although at 
> the time I wrote this, and in a GCC 7 code base, I had followed the code 
> through and observed what it was doing.
> 
> Basically, if you have two patterns that do the same operation, but one 
> has a "parallel" with an additional "use", then combine will tend to 
> prefer the one without the "use". That doesn't stop the code working, 
> but it makes a premature (accidental) decision about instruction 
> selection that we'd prefer to leave to the register allocator.
> 
> I don't recall if it did this to lone instructions, but it would 
> certainly do so when combining two (or more) instructions, and IIRC 
> there are typically plenty of simple moves around that can be easily 
> combined.

If you don't want useless USEs deleted, use UNSPEC_VOLATILE instead?
Or actually use the register, i.e. as input to an actually needed
instruction.

If combine is changing an X and a USE to just that X if it can, combine
is doing a great job!

(combine cannot "combine" one instruction, fwiw; this sometime could be
useful (so just run simplification on every single instruction, see if
that makes a simpler valid instruction; and indeed a common case where it
can help is if the insn is a parallel and one of the arms of that isn't
needed).


Segher

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-12 17:20       ` Segher Boessenkool
@ 2018-11-12 17:52         ` Andrew Stubbs
  2018-11-12 18:33           ` Segher Boessenkool
  2018-11-12 18:55           ` Jeff Law
  0 siblings, 2 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-12 17:52 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Jeff Law, gcc-patches

On 12/11/2018 17:20, Segher Boessenkool wrote:
> If you don't want useless USEs deleted, use UNSPEC_VOLATILE instead?
> Or actually use the register, i.e. as input to an actually needed
> instruction.

They're not useless. If we want to do scalar operations in vector 
registers (and we often do, on this target), then we need to write a "1" 
into the EXEC (vector mask) register.

Unless we want to rewrite all scalar operations in terms of vec_merge 
then there's no way to "actually use the register".

There are additional patterns that do scalar operations in scalar 
registers, and therefore do not depend on EXEC, but there are not a 
complete set of instructions for these, so usually we don't use those 
until reload_completed (via splits). I did think of simply disabling 
them until reload_completed, but there are cases where we do want them, 
so that didn't work.

Of course, it's possible that we took a wrong turn early on and ended up 
with a sub-optimal arrangement, but it is where we are.

> If combine is changing an X and a USE to just that X if it can, combine
> is doing a great job!

Not if the "simpler" instruction is somehow more expensive. And, in our 
case, it isn't the instruction itself that is more expensive, but the 
extra instructions that may (or may not) need to be inserted around it 
later.

I might investigate putting the USE inside an UNSPEC_VOLATILE. That 
would have the advantage of letting combine run again. This feels like a 
future project I'd rather not have block the port submission though.

If there are two instructions that both have an UNSPEC_VOLATILE, will 
combine coalesce them into one in the combined pattern?

Thanks

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-12 17:52         ` Andrew Stubbs
@ 2018-11-12 18:33           ` Segher Boessenkool
  2018-11-12 18:55           ` Jeff Law
  1 sibling, 0 replies; 187+ messages in thread
From: Segher Boessenkool @ 2018-11-12 18:33 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: Jeff Law, gcc-patches

On Mon, Nov 12, 2018 at 05:52:25PM +0000, Andrew Stubbs wrote:
> On 12/11/2018 17:20, Segher Boessenkool wrote:
> >If you don't want useless USEs deleted, use UNSPEC_VOLATILE instead?
> >Or actually use the register, i.e. as input to an actually needed
> >instruction.
> 
> They're not useless.

> >If combine is changing an X and a USE to just that X if it can, combine
> >is doing a great job!

Actually, it is incorrect to delete a USE.

Please open a PR.  Thanks.


Segher

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-12 17:52         ` Andrew Stubbs
  2018-11-12 18:33           ` Segher Boessenkool
@ 2018-11-12 18:55           ` Jeff Law
  2018-11-13 10:23             ` Andrew Stubbs
  2018-11-16 16:10             ` Segher Boessenkool
  1 sibling, 2 replies; 187+ messages in thread
From: Jeff Law @ 2018-11-12 18:55 UTC (permalink / raw)
  To: Andrew Stubbs, Segher Boessenkool; +Cc: gcc-patches

On 11/12/18 10:52 AM, Andrew Stubbs wrote:
> On 12/11/2018 17:20, Segher Boessenkool wrote:
>> If you don't want useless USEs deleted, use UNSPEC_VOLATILE instead?
>> Or actually use the register, i.e. as input to an actually needed
>> instruction.
> 
> They're not useless. If we want to do scalar operations in vector
> registers (and we often do, on this target), then we need to write a "1"
> into the EXEC (vector mask) register.
Presumably you're setting up active lanes or some such.  This may
ultimately be better modeled by ignoring the problem until much later in
the pipeline.

Shortly before assembly output you run a little LCM-like pass to find
optimal points to insert the assignment to the vector register.  It's a
lot like the mode switching stuff or zeroing the upper halves of the AVX
regsiters to avoid partial register stalls.  THe local properties are
different, but these all feel like the same class of problem.

> 
> Unless we want to rewrite all scalar operations in terms of vec_merge
> then there's no way to "actually use the register".
I think you need to correctly model it.  If you lie to the compiler
about what's going on, you're going to run into problems.
> 
> I might investigate putting the USE inside an UNSPEC_VOLATILE. That
> would have the advantage of letting combine run again. This feels like a
> future project I'd rather not have block the port submission though.
The gcn_legitimate_combined_insn code isn't really acceptable though.
You need a cleaner solution here.

> 
> If there are two instructions that both have an UNSPEC_VOLATILE, will
> combine coalesce them into one in the combined pattern?
I think you can put a different constant on each.

jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-12 18:55           ` Jeff Law
@ 2018-11-13 10:23             ` Andrew Stubbs
  2018-11-13 10:33               ` Segher Boessenkool
  2018-11-16 16:10             ` Segher Boessenkool
  1 sibling, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-13 10:23 UTC (permalink / raw)
  To: Jeff Law, Segher Boessenkool; +Cc: gcc-patches

On 12/11/2018 18:54, Jeff Law wrote:
> On 11/12/18 10:52 AM, Andrew Stubbs wrote:
>> On 12/11/2018 17:20, Segher Boessenkool wrote:
>>> If you don't want useless USEs deleted, use UNSPEC_VOLATILE instead?
>>> Or actually use the register, i.e. as input to an actually needed
>>> instruction.
>>
>> They're not useless. If we want to do scalar operations in vector
>> registers (and we often do, on this target), then we need to write a "1"
>> into the EXEC (vector mask) register.
> Presumably you're setting up active lanes or some such.  This may
> ultimately be better modeled by ignoring the problem until much later in
> the pipeline.
> 
> Shortly before assembly output you run a little LCM-like pass to find
> optimal points to insert the assignment to the vector register.  It's a
> lot like the mode switching stuff or zeroing the upper halves of the AVX
> regsiters to avoid partial register stalls.  THe local properties are
> different, but these all feel like the same class of problem.

Yes, this is one of the plans. The tricky bit is getting that to work 
right with while_ult.

Of course, the real reason this is still an issue is finding time for it 
when there's lots else to be done.

>> Unless we want to rewrite all scalar operations in terms of vec_merge
>> then there's no way to "actually use the register".
> I think you need to correctly model it.  If you lie to the compiler
> about what's going on, you're going to run into problems.
>>
>> I might investigate putting the USE inside an UNSPEC_VOLATILE. That
>> would have the advantage of letting combine run again. This feels like a
>> future project I'd rather not have block the port submission though.
> The gcn_legitimate_combined_insn code isn't really acceptable though.
> You need a cleaner solution here.

Now that Segher says the combine issue is a bug, I think I'll remove 
gcn_legitimate_combined_insn altogether -- it's only an optimization 
issue, after all -- and try to find a testcase I can report as a PR.

I don't suppose I can easily reproduce it on another architecture, so 
it'll have to wait until GCN is committed.

It's also possible that the issue has ceased to exist since GCC 7.

>> If there are two instructions that both have an UNSPEC_VOLATILE, will
>> combine coalesce them into one in the combined pattern?
> I think you can put a different constant on each.

I was thinking of it the other way around. Two instructions put together 
would still only want to "use" EXEC once. Of course, it would only work 
when the required EXEC value is the same, or in the same pseudoreg.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-13 10:23             ` Andrew Stubbs
@ 2018-11-13 10:33               ` Segher Boessenkool
  0 siblings, 0 replies; 187+ messages in thread
From: Segher Boessenkool @ 2018-11-13 10:33 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: Jeff Law, gcc-patches

On Tue, Nov 13, 2018 at 10:23:12AM +0000, Andrew Stubbs wrote:
> Now that Segher says the combine issue is a bug,

Well, first show what really happens; if it really deletes a USE, that
is a bug yes.  rtl.def says:

/* Indicate something is used in a way that we don't want to explain.
   For example, subroutine calls will use the register
   in which the static chain is passed.

   USE can not appear as an operand of other rtx except for PARALLEL.
   USE is not deletable, as it indicates that the operand
   is used in some unknown way.  */
DEF_RTL_EXPR(USE, "use", "e", RTX_EXTRA)


> I don't suppose I can easily reproduce it on another architecture, so 
> it'll have to wait until GCN is committed.

Just show the -fdump-rtl-combine-all output?

> It's also possible that the issue has ceased to exist since GCC 7.

Yes, please mention the version in the PR, if it's not trunk :-)


Segher

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-12 12:53     ` Andrew Stubbs
  2018-11-12 17:20       ` Segher Boessenkool
@ 2018-11-14 22:31       ` Jeff Law
  2018-11-15  9:55         ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Jeff Law @ 2018-11-14 22:31 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 11/12/18 5:53 AM, Andrew Stubbs wrote:
> On 09/11/2018 19:39, Jeff Law wrote:
>>> +
>>> +/* Generate epilogue.  Called from gen_epilogue during
>>> pro_and_epilogue pass.
>>> +
>>> +   See gcn_expand_prologue for stack details.  */
>>> +
>>> +void
>>> +gcn_expand_epilogue (void)
>> You probably need a barrier in here to ensure that the scheduler doesn't
>> move an aliased memory reference into the local stack beyond the stack
>> adjustment.
>>
>> You're less likely to run into it because you eliminate frame pointers
>> fairly aggressively, but it's still the right thing to do.
> 
> Sorry, I'm not sure I understand what the problem is? How can this
> happen? Surely the scheduler wouldn't change the logic of the code?
There's a particular case that has historically been problematical.

If you have this kind of sequence in the epilogue

	restore register using FP
	move fp->sp  (deallocates frame)
	return

Under certain circumstances the scheduler can swap the register restore
and move from fp into sp creating something like this:

	move fp->sp (deallocates frame)
	restore register using FP (reads from deallocated frame)
	return

That would normally be OK, except if you take an interrupt between the
first two instructions.  If interrupt handling is done without switching
stacks, then the interrupt handler may write into the just de-allocated
frame destroying the values that were saved in the prologue.

You may not need to worry about that today on the GCN port, but you
really want to fix it now so that it's never a problem.  You *really*
don't want to have to debug this kind of problem in the wild.  Been
there, done that, more than once :(


>> This seems a bit hokey.  Why specifically is combine removing the USE?
> 
> I don't understand combine fully enough to explain it now, although at
> the time I wrote this, and in a GCC 7 code base, I had followed the code
> through and observed what it was doing.
> 
> Basically, if you have two patterns that do the same operation, but one
> has a "parallel" with an additional "use", then combine will tend to
> prefer the one without the "use". That doesn't stop the code working,
> but it makes a premature (accidental) decision about instruction
> selection that we'd prefer to leave to the register allocator.
> 
> I don't recall if it did this to lone instructions, but it would
> certainly do so when combining two (or more) instructions, and IIRC
> there are typically plenty of simple moves around that can be easily
> combined.
I would hazard a guess that combine saw the one without the use as
"simpler" and preferred it.  I think you've made a bit of a fundamental
problem with the way the EXEC register is being handled.  Hopefully you
can get by with some magic UNSPEC wrappers without having to do too much
surgery.

> 
>>> +  /* "Manually Inserted Wait States (NOPs)."
>>> +
>>> +     GCN hardware detects most kinds of register dependencies, but
>>> there
>>> +     are some exceptions documented in the ISA manual.  This pass
>>> +     detects the missed cases, and inserts the documented number of
>>> NOPs
>>> +     required for correct execution.  */
>> How unpleasant :(  But if it's what you need to do, so be it.  I'll
>> assume the compiler is the right place to do this -- though some ports
>> handle this kind of stuff in the assembler or linker.
> 
> We're using an LLVM assembler and linker, so we have tried to use them
> as is, rather than making parallel changes that would prevent GCC
> working with the last numbered release of LLVM (see the work around for
> assembler bugs in the BImode mode instruction).
> 
> Expecting the assembler to fix this up would also throw off the
> compiler's offset calculations, and the near/far branch instructions
> have different register requirements it's better for the compiler to
> know about.
> 
> The MIPS backend also inserts NOPs in a similar way.
MIPS isn't that simple.  If you're not in a .reorder block and you don't
have interlocks, then it leaves it to the assembler...

If you have near/far branch calculations, then those have to be aware of
the nop insertions and you're generally better off doing them both in
the same tool.  You've chosen the compiler.  It's a valid choice, but
does have some downsides.  The assembler is a valid choice too, with a
different set of downsides.

> 
> In future, I'd like to have the scheduler insert real instructions into
> these slots, but that's very much on the to-do list.
If you you can model this as a latency between the two points where you
need to insert the nops, then the scheduler will fill in what it can.
But it doesn't generally handle non-interlocked processors.   So you'll
still want your little pass to fix things up when the scheduler couldn't
find useful work to schedule into those bubbles.

> 
> Oh, OK. :-(
> 
> I have no idea whether the architecture has those issues or not.
The guideline I would give to determine if you're vulnerable...  Do you
have speculation, including the ability to speculate past a memory
operation, branch prediction, memory caches and high resolution timer
(ie, like a cycle timer).  If you've got those, then the processor is
likely vulnerable to a spectre V1 style attack.  Those are the basic
building blocks.

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-14 22:31       ` Jeff Law
@ 2018-11-15  9:55         ` Andrew Stubbs
  2018-11-16 13:33           ` Andrew Stubbs
  0 siblings, 1 reply; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-15  9:55 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 14/11/2018 22:30, Jeff Law wrote:
> There's a particular case that has historically been problematical.
> 
> If you have this kind of sequence in the epilogue
> 
> 	restore register using FP
> 	move fp->sp  (deallocates frame)
> 	return
> 
> Under certain circumstances the scheduler can swap the register restore
> and move from fp into sp creating something like this:
> 
> 	move fp->sp (deallocates frame)
> 	restore register using FP (reads from deallocated frame)
> 	return
> 
> That would normally be OK, except if you take an interrupt between the
> first two instructions.  If interrupt handling is done without switching
> stacks, then the interrupt handler may write into the just de-allocated
> frame destroying the values that were saved in the prologue.

OK, so the barrier needs to be right before the stack pointer moves. I 
can do that. :-)

Presumably the same is true for prologues, except that the barrier needs 
to be after the stack adjustment.

> You may not need to worry about that today on the GCN port, but you
> really want to fix it now so that it's never a problem.  You *really*
> don't want to have to debug this kind of problem in the wild.  Been
> there, done that, more than once :(

I'm not exactly sure how interrupts work on this platform -- we've had 
no use for them yet -- but without a debugger, and with up to 1024 
threads running simultaneously, you can be sure I don't want to debug it!

> I would hazard a guess that combine saw the one without the use as
> "simpler" and preferred it.  I think you've made a bit of a fundamental
> problem with the way the EXEC register is being handled.  Hopefully you
> can get by with some magic UNSPEC wrappers without having to do too much
> surgery.

Exactly so. An initial experiment with combine re-enabled has not shown 
any errors, so it's possible the problem has gone away, but I've not 
been over the full testsuite yet (and you wouldn't expect actual 
failures anyway).

>> In future, I'd like to have the scheduler insert real instructions into
>> these slots, but that's very much on the to-do list.
> If you you can model this as a latency between the two points where you
> need to insert the nops, then the scheduler will fill in what it can.
> But it doesn't generally handle non-interlocked processors.   So you'll
> still want your little pass to fix things up when the scheduler couldn't
> find useful work to schedule into those bubbles.

Absolutely, the scheduler is about optimization and this md_reorg pass 
is about correctness.

>> I have no idea whether the architecture has those issues or not.
> The guideline I would give to determine if you're vulnerable...  Do you
> have speculation, including the ability to speculate past a memory
> operation, branch prediction, memory caches and high resolution timer
> (ie, like a cycle timer).  If you've got those, then the processor is
> likely vulnerable to a spectre V1 style attack.  Those are the basic
> building blocks.

We have cycle timers and caches, but I'll have to ask AMD about the 
other details.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 01/25] Handle vectors that don't fit in an integer.
  2018-09-14 16:03   ` Richard Sandiford
@ 2018-11-15 17:20     ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-15 17:20 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 4861 bytes --]

On 14/09/2018 16:37, Richard Sandiford wrote:
>> diff --git a/gcc/combine.c b/gcc/combine.c
>> index a2649b6..cbf9dae 100644
>> --- a/gcc/combine.c
>> +++ b/gcc/combine.c
>> @@ -8621,7 +8621,13 @@ gen_lowpart_or_truncate (machine_mode mode, rtx x)
>>       {
>>         /* Bit-cast X into an integer mode.  */
>>         if (!SCALAR_INT_MODE_P (GET_MODE (x)))
>> -	x = gen_lowpart (int_mode_for_mode (GET_MODE (x)).require (), x);
>> +	{
>> +	  enum machine_mode imode =
>> +	    int_mode_for_mode (GET_MODE (x)).require ();
>> +	  if (imode == BLKmode)
>> +	    return gen_rtx_CLOBBER (mode, const0_rtx);
>> +	  x = gen_lowpart (imode, x);
> 
> require () will ICE if there isn't an integer mode and always returns
> a scalar_int_mode, so this looks like a no-op.  I think you want
> something like:

This is a patch that I inherited from Honza and Martin and didn't know 
what testcase it fixed.

I think that it being broken shows that it's no longer necessary, and 
reverting the patch and retesting confirms this suspicion.

I've removed it from the patch.

>> @@ -11698,6 +11704,11 @@ gen_lowpart_for_combine (machine_mode omode, rtx x)
>>     if (omode == imode)
>>       return x;
>>   
>> +  /* This can happen when there is no integer mode corresponding
>> +     to a size of vector mode.  */
>> +  if (omode == BLKmode)
>> +    goto fail;
>> +
>>     /* We can only support MODE being wider than a word if X is a
>>        constant integer or has a mode the same size.  */
>>     if (maybe_gt (GET_MODE_SIZE (omode), UNITS_PER_WORD)
> 
> This seems like it's working around a bug in ther caller.

Again, I inherited this hunk. Removing it and retesting shows no 
regressions, so I'm dropping it.

>> diff --git a/gcc/expr.c b/gcc/expr.c
>> index cd5cf12..776254a 100644
>> --- a/gcc/expr.c
>> +++ b/gcc/expr.c
>> @@ -10569,6 +10569,14 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>>   			  || maybe_gt (bitpos + bitsize,
>>   				       GET_MODE_BITSIZE (mode2)));
>>   
>> +	/* If the result is in BLKmode and the underlying object is a
>> +	   vector in a register, and the size of the vector is larger than
>> +	   the largest integer mode, then we must force OP0 to be in memory
>> +	   as this is assumed in later code.  */
>> +	if (REG_P (op0) && VECTOR_MODE_P (mode2) && mode == BLKmode
>> +	    && maybe_gt (bitsize, MAX_FIXED_MODE_SIZE))
>> +	  must_force_mem = 1;
>> +
>>   	/* Handle CONCAT first.  */
>>   	if (GET_CODE (op0) == CONCAT && !must_force_mem)
>>   	  {
> 
> Are you sure this is still needed after:
> 
> 2018-06-04  Richard Sandiford  <richard.sandiford@linaro.org>
> 
> 	* expr.c (expand_expr_real_1): Force the operand into memory if
> 	its TYPE_MODE is BLKmode and if there is no integer mode for
> 	the number of bits being extracted.

Apparently you're right about this. Hunk dropped.

>> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> index 8d94fca..607a2bd 100644
>> --- a/gcc/tree-vect-stmts.c
>> +++ b/gcc/tree-vect-stmts.c
>> @@ -6702,12 +6702,12 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>>   		     supported.  */
>>   		  unsigned lsize
>>   		    = group_size * GET_MODE_BITSIZE (elmode);
>> -		  elmode = int_mode_for_size (lsize, 0).require ();
>>   		  unsigned int lnunits = const_nunits / group_size;
>>   		  /* If we can't construct such a vector fall back to
>>   		     element extracts from the original vector type and
>>   		     element size stores.  */
>> -		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
>> +		  if (int_mode_for_size (lsize, 0).exists (&elmode)
>> +		      && mode_for_vector (elmode, lnunits).exists (&vmode)
>>   		      && VECTOR_MODE_P (vmode)
>>   		      && targetm.vector_mode_supported_p (vmode)
>>   		      && (convert_optab_handler (vec_extract_optab,
>> @@ -7839,11 +7839,11 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>>   		     to a larger load.  */
>>   		  unsigned lsize
>>   		    = group_size * TYPE_PRECISION (TREE_TYPE (vectype));
>> -		  elmode = int_mode_for_size (lsize, 0).require ();
>>   		  unsigned int lnunits = const_nunits / group_size;
>>   		  /* If we can't construct such a vector fall back to
>>   		     element loads of the original vector type.  */
>> -		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
>> +		  if (int_mode_for_size (lsize, 0).exists (&elmode)
>> +		      && mode_for_vector (elmode, lnunits).exists (&vmode)
>>   		      && VECTOR_MODE_P (vmode)
>>   		      && targetm.vector_mode_supported_p (vmode)
>>   		      && (convert_optab_handler (vec_init_optab, vmode, elmode)
> 
> These two are OK independently of the rest (if that's convenient).

Thanks, I've committed the attached. These are the only parts of the 
patch that remain. I've confirmed that there are failures without these 
hunks.

Andrew

[-- Attachment #2: 181115-vector-int-sizes.patch --]
[-- Type: text/x-patch, Size: 2324 bytes --]

Handle vectors that don't fit in an integer.

GCN vector sizes range between 64 and 512 bytes, none of which have
correspondingly sized integer modes.  This breaks a number of assumptions
throughout the compiler, but I don't really want to create modes just for this
purpose.

Instead, this patch fixes up the cases that I've found, so far, such that the
compiler tries something else, or fails to optimize, rather than just ICE.

2018-11-15  Andrew Stubbs  <ams@codesourcery.com>
            Kwok Cheung Yeung  <kcy@codesourcery.com>

	gcc/
	* tree-vect-stmts.c (vectorizable_store): Don't ICE when
	int_mode_for_size fails.
	(vectorizable_load): Likewise.

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 75d77d2..3509d29 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -6672,12 +6672,12 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		     supported.  */
 		  unsigned lsize
 		    = group_size * GET_MODE_BITSIZE (elmode);
-		  elmode = int_mode_for_size (lsize, 0).require ();
 		  unsigned int lnunits = const_nunits / group_size;
 		  /* If we can't construct such a vector fall back to
 		     element extracts from the original vector type and
 		     element size stores.  */
-		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
+		  if (int_mode_for_size (lsize, 0).exists (&elmode)
+		      && mode_for_vector (elmode, lnunits).exists (&vmode)
 		      && VECTOR_MODE_P (vmode)
 		      && targetm.vector_mode_supported_p (vmode)
 		      && (convert_optab_handler (vec_extract_optab,
@@ -7806,11 +7806,11 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		     to a larger load.  */
 		  unsigned lsize
 		    = group_size * TYPE_PRECISION (TREE_TYPE (vectype));
-		  elmode = int_mode_for_size (lsize, 0).require ();
 		  unsigned int lnunits = const_nunits / group_size;
 		  /* If we can't construct such a vector fall back to
 		     element loads of the original vector type.  */
-		  if (mode_for_vector (elmode, lnunits).exists (&vmode)
+		  if (int_mode_for_size (lsize, 0).exists (&elmode)
+		      && mode_for_vector (elmode, lnunits).exists (&vmode)
 		      && VECTOR_MODE_P (vmode)
 		      && targetm.vector_mode_supported_p (vmode)
 		      && (convert_optab_handler (vec_init_optab, vmode, elmode)

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-15  9:55         ` Andrew Stubbs
@ 2018-11-16 13:33           ` Andrew Stubbs
  0 siblings, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2018-11-16 13:33 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

>> The guideline I would give to determine if you're vulnerable...  Do you
>> have speculation, including the ability to speculate past a memory
>> operation, branch prediction, memory caches and high resolution timer
>> (ie, like a cycle timer).  If you've got those, then the processor is
>> likely vulnerable to a spectre V1 style attack.  Those are the basic
>> building blocks.
> 
> We have cycle timers and caches, but I'll have to ask AMD about the 
> other details.

There's no speculation or branch prediction, apparently.

I'll set it up to use speculation_safe_value_not_needed.

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-12 18:55           ` Jeff Law
  2018-11-13 10:23             ` Andrew Stubbs
@ 2018-11-16 16:10             ` Segher Boessenkool
  2018-11-17 14:07               ` Segher Boessenkool
  1 sibling, 1 reply; 187+ messages in thread
From: Segher Boessenkool @ 2018-11-16 16:10 UTC (permalink / raw)
  To: Jeff Law; +Cc: Andrew Stubbs, gcc-patches

On Mon, Nov 12, 2018 at 11:54:58AM -0700, Jeff Law wrote:
> On 11/12/18 10:52 AM, Andrew Stubbs wrote:
> > If there are two instructions that both have an UNSPEC_VOLATILE, will
> > combine coalesce them into one in the combined pattern?
> I think you can put a different constant on each.

combine (like everything else) will not remove an unspec_volatile.  Two
identical ones can be merged, in theory anyway, but the resulting program
will always still execute the same number of them, in the same order.
unspec_volatile's have unknown side effects, and those have to happen.

If you really need to prevent merging them, then sure, using different
unspec numbers will work, sure.


Segher

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/25] GCN Back-end (part 2/2).
  2018-11-16 16:10             ` Segher Boessenkool
@ 2018-11-17 14:07               ` Segher Boessenkool
  0 siblings, 0 replies; 187+ messages in thread
From: Segher Boessenkool @ 2018-11-17 14:07 UTC (permalink / raw)
  To: Jeff Law; +Cc: Andrew Stubbs, gcc-patches

On Fri, Nov 16, 2018 at 10:09:59AM -0600, Segher Boessenkool wrote:
> On Mon, Nov 12, 2018 at 11:54:58AM -0700, Jeff Law wrote:
> > On 11/12/18 10:52 AM, Andrew Stubbs wrote:
> > > If there are two instructions that both have an UNSPEC_VOLATILE, will
> > > combine coalesce them into one in the combined pattern?
> > I think you can put a different constant on each.
> 
> combine (like everything else) will not remove an unspec_volatile.  Two
> identical ones can be merged, in theory anyway, but the resulting program
> will always still execute the same number of them, in the same order.
> unspec_volatile's have unknown side effects, and those have to happen.
> 
> If you really need to prevent merging them, then sure, using different
> unspec numbers will work, sure.

Here is a simple example, PowerPC code:

long f(long x)
{
	if (x)
		return __builtin_ppc_mftb();
	else
		return __builtin_ppc_mftb();
}

This compiles to just

        mfspr 3,268
        blr

[ That builtin is an unspec_volatile:
(define_insn "rs6000_mftb_<mode>"
  [(set (match_operand:GPR 0 "gpc_reg_operand" "=r")
        (unspec_volatile:GPR [(const_int 0)] UNSPECV_MFTB))]
...)
]

It is optimised by the pre pass.  If you say -fno-tree-pre -fno-code-hoisting
it is still two mftb's in RTL, but the jump2 pass (crossjumping) optimises
that away, too.  So, both gimple and RTL passes know how to do this, and
both have the same semantics for unspec_voilatile.

Changing the testcase to

long f(long x)
{
        if (x) {
                __builtin_ppc_mftb();
                return __builtin_ppc_mftb();
        } else {
                __builtin_ppc_mftb();
                return __builtin_ppc_mftb();
        }
}

results in

        mfspr 9,268
        mfspr 3,268
        blr

showing that two identical unspec_volatile's are not elided if both are
executed.


Segher

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 11/25] Simplify vec_merge according to the mask.
  2018-09-28 13:50                     ` Andrew Stubbs
@ 2019-02-22  3:40                       ` H.J. Lu
  0 siblings, 0 replies; 187+ messages in thread
From: H.J. Lu @ 2019-02-22  3:40 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: GCC Patches, Richard Sandiford

On Fri, Sep 28, 2018 at 6:33 AM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 28/09/18 09:11, Richard Sandiford wrote:
> > Yes, thanks.
>
> Committed.
>
> Thanks for all the reviews. :-)
>

This caused:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445


-- 
H.J.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2018-09-05 11:49 ` [PATCH 02/25] Propagate address spaces to builtins ams
  2018-09-20 13:09   ` Richard Biener
  2018-09-22 19:22   ` Andreas Schwab
@ 2019-09-03 14:01   ` Kyrill Tkachov
  2019-09-03 15:00     ` Jeff Law
  2019-09-03 15:43     ` Andrew Stubbs
  2 siblings, 2 replies; 187+ messages in thread
From: Kyrill Tkachov @ 2019-09-03 14:01 UTC (permalink / raw)
  To: ams, gcc-patches, richard.henderson

Hi all,

On 9/5/18 12:48 PM, ams@codesourcery.com wrote:
>
> At present, pointers passed to builtin functions, including atomic 
> operators,
> are stripped of their address space properties.  This doesn't seem to be
> deliberate, it just omits to copy them.
>
> Not only that, but it forces pointer sizes to Pmode, which isn't 
> appropriate
> for all address spaces.
>
> This patch attempts to correct both issues.  It works for GCN atomics and
> GCN OpenACC gang-private variables.
>
> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>             Julian Brown  <julian@codesourcery.com>
>
>         gcc/
>         * builtins.c (get_builtin_sync_mem): Handle address spaces.


Sorry for responding to this so late. I'm testing a rebased version of 
Richard's OOL atomic patches [1] and am hitting an ICE building the 
-mabi=ilp32 libgfortran multilib for aarch64-none-elf:

0x7284db emit_library_call_value_1(int, rtx_def*, rtx_def*, 
libcall_type, machine_mode, int, std::pair<rtx_def*, machine_mode>*)
         $SRC/gcc/calls.c:4915
0x1037817 emit_library_call_value(rtx_def*, rtx_def*, libcall_type, 
machine_mode, rtx_def*, machine_mode, rtx_def*, machine_mode, rtx_def*, 
machine_mode)
         $SRC/gcc/rtl.h:4240
0x1037817 aarch64_expand_compare_and_swap(rtx_def**)
         $SRC/gcc/config/aarch64/aarch64.c:16981
0x1353a43 gen_atomic_compare_and_swapsi(rtx_def*, rtx_def*, rtx_def*, 
rtx_def*, rtx_def*, rtx_def*, rtx_def*, rtx_def*)
         $SRC/gcc/config/aarch64/atomics.md:34
0xb1f9f1 insn_gen_fn::operator()(rtx_def*, rtx_def*, rtx_def*, rtx_def*, 
rtx_def*, rtx_def*, rtx_def*, rtx_def*) const
         $SRC/gcc/recog.h:324
0xb1f9f1 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
         $SRC/gcc/optabs.c:7443
0xb1fa78 maybe_expand_insn(insn_code, unsigned int, expand_operand*)
         $SRC/gcc/optabs.c:7459
0xb21024 expand_atomic_compare_and_swap(rtx_def**, rtx_def**, rtx_def*, 
rtx_def*, rtx_def*, bool, memmodel, memmodel)
         $SRC/gcc/optabs.c:6448
0x709bd3 expand_builtin_atomic_compare_exchange
         $SRC/gcc/builtins.c:6379
0x71a4e9 expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
         $SRC/gcc/builtins.c:8147
0x88b746 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, 
expand_modifier, rtx_def**, bool)
         $SRC/gcc/expr.c:11052
0x88cce6 expand_expr_real(tree_node*, rtx_def*, machine_mode, 
expand_modifier, rtx_def**, bool)
         $SRC/gcc/expr.c:8289
0x74cb47 expand_expr
         $SRC/gcc/expr.h:281
0x74cb47 expand_call_stmt
         $SRC/gcc/cfgexpand.c:2731
0x74cb47 expand_gimple_stmt_1
         $SRC/gcc/cfgexpand.c:3710
0x74cb47 expand_gimple_stmt
         $SRC/gcc/cfgexpand.c:3875
0x75439b expand_gimple_basic_block
         $SRC/gcc/cfgexpand.c:5915
0x7563ab execute
         $SRC/gcc/cfgexpand.c:6538
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

A MEM rtx now uses a DImode address where for ILP32 we expect SImode.

This looks to be because....

[1] https://gcc.gnu.org/ml/gcc-patches/2018-11/msg00062.html


> ---
>  gcc/builtins.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>

0002-Propagate-address-spaces-to-builtins.patch

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 58ea747..361361c 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -5781,14 +5781,21 @@ static rtx
  get_builtin_sync_mem (tree loc, machine_mode mode)
  {
    rtx addr, mem;
+  int addr_space = TYPE_ADDR_SPACE (POINTER_TYPE_P (TREE_TYPE (loc))
+				    ? TREE_TYPE (TREE_TYPE (loc))
+				    : TREE_TYPE (loc));
+  scalar_int_mode addr_mode = targetm.addr_space.address_mode (addr_space);
  
... This now returns Pmode (the default for the hook) for aarch64 ILP32, which is always DImode.

-  addr = expand_expr (loc, NULL_RTX, ptr_mode, EXPAND_SUM);

Before this patch we used ptr_mode, which does the right thing for AArch64 ILP32.
Do you think we should just be implementing targetm.addr_space.address_mode for AArch64 to return SImode for ILP32?

Thanks,
Kyrill


-  addr = convert_memory_address (Pmode, addr);
+  addr = expand_expr (loc, NULL_RTX, addr_mode, EXPAND_SUM);
  
    /* Note that we explicitly do not want any alias information for this
       memory, so that we kill all other live memories.  Otherwise we don't
       satisfy the full barrier semantics of the intrinsic.  */
-  mem = validize_mem (gen_rtx_MEM (mode, addr));
+  mem = gen_rtx_MEM (mode, addr);
+
+  set_mem_addr_space (mem, addr_space);
+
+  mem = validize_mem (mem);
  
    /* The alignment needs to be at least according to that of the mode.  */
    set_mem_align (mem, MAX (GET_MODE_ALIGNMENT (mode),

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2019-09-03 14:01   ` [PATCH 02/25] Propagate address spaces to builtins Kyrill Tkachov
@ 2019-09-03 15:00     ` Jeff Law
  2019-09-04 14:21       ` Kyrill Tkachov
  2019-09-03 15:43     ` Andrew Stubbs
  1 sibling, 1 reply; 187+ messages in thread
From: Jeff Law @ 2019-09-03 15:00 UTC (permalink / raw)
  To: Kyrill Tkachov, ams, gcc-patches, richard.henderson

On 9/3/19 8:01 AM, Kyrill Tkachov wrote:
> Hi all,
> 
> On 9/5/18 12:48 PM, ams@codesourcery.com wrote:
>>
>> At present, pointers passed to builtin functions, including atomic
>> operators,
>> are stripped of their address space properties.  This doesn't seem to be
>> deliberate, it just omits to copy them.
>>
>> Not only that, but it forces pointer sizes to Pmode, which isn't
>> appropriate
>> for all address spaces.
>>
>> This patch attempts to correct both issues.  It works for GCN atomics and
>> GCN OpenACC gang-private variables.
>>
>> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>>             Julian Brown  <julian@codesourcery.com>
>>
>>         gcc/
>>         * builtins.c (get_builtin_sync_mem): Handle address spaces.
> 
> 
> Sorry for responding to this so late. I'm testing a rebased version of
> Richard's OOL atomic patches [1] and am hitting an ICE building the
> -mabi=ilp32 libgfortran multilib for aarch64-none-elf:
> 
> 0x7284db emit_library_call_value_1(int, rtx_def*, rtx_def*,
> libcall_type, machine_mode, int, std::pair<rtx_def*, machine_mode>*)
>         $SRC/gcc/calls.c:4915
> 0x1037817 emit_library_call_value(rtx_def*, rtx_def*, libcall_type,
> machine_mode, rtx_def*, machine_mode, rtx_def*, machine_mode, rtx_def*,
> machine_mode)
>         $SRC/gcc/rtl.h:4240
> 0x1037817 aarch64_expand_compare_and_swap(rtx_def**)
>         $SRC/gcc/config/aarch64/aarch64.c:16981
> 0x1353a43 gen_atomic_compare_and_swapsi(rtx_def*, rtx_def*, rtx_def*,
> rtx_def*, rtx_def*, rtx_def*, rtx_def*, rtx_def*)
>         $SRC/gcc/config/aarch64/atomics.md:34
> 0xb1f9f1 insn_gen_fn::operator()(rtx_def*, rtx_def*, rtx_def*, rtx_def*,
> rtx_def*, rtx_def*, rtx_def*, rtx_def*) const
>         $SRC/gcc/recog.h:324
> 0xb1f9f1 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
>         $SRC/gcc/optabs.c:7443
> 0xb1fa78 maybe_expand_insn(insn_code, unsigned int, expand_operand*)
>         $SRC/gcc/optabs.c:7459
> 0xb21024 expand_atomic_compare_and_swap(rtx_def**, rtx_def**, rtx_def*,
> rtx_def*, rtx_def*, bool, memmodel, memmodel)
>         $SRC/gcc/optabs.c:6448
> 0x709bd3 expand_builtin_atomic_compare_exchange
>         $SRC/gcc/builtins.c:6379
> 0x71a4e9 expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
>         $SRC/gcc/builtins.c:8147
> 0x88b746 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         $SRC/gcc/expr.c:11052
> 0x88cce6 expand_expr_real(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         $SRC/gcc/expr.c:8289
> 0x74cb47 expand_expr
>         $SRC/gcc/expr.h:281
> 0x74cb47 expand_call_stmt
>         $SRC/gcc/cfgexpand.c:2731
> 0x74cb47 expand_gimple_stmt_1
>         $SRC/gcc/cfgexpand.c:3710
> 0x74cb47 expand_gimple_stmt
>         $SRC/gcc/cfgexpand.c:3875
> 0x75439b expand_gimple_basic_block
>         $SRC/gcc/cfgexpand.c:5915
> 0x7563ab execute
>         $SRC/gcc/cfgexpand.c:6538
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See <https://gcc.gnu.org/bugs/> for instructions.
> 
> A MEM rtx now uses a DImode address where for ILP32 we expect SImode.
> 
> This looks to be because....
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2018-11/msg00062.html
> 
> 
>> ---
>>  gcc/builtins.c | 13 ++++++++++---
>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>
> 
> 0002-Propagate-address-spaces-to-builtins.patch
> 
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 58ea747..361361c 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -5781,14 +5781,21 @@ static rtx
>  get_builtin_sync_mem (tree loc, machine_mode mode)
>  {
>    rtx addr, mem;
> +  int addr_space = TYPE_ADDR_SPACE (POINTER_TYPE_P (TREE_TYPE (loc))
> +                    ? TREE_TYPE (TREE_TYPE (loc))
> +                    : TREE_TYPE (loc));
> +  scalar_int_mode addr_mode = targetm.addr_space.address_mode
> (addr_space);
>  
> ... This now returns Pmode (the default for the hook) for aarch64 ILP32,
> which is always DImode.
> 
> -  addr = expand_expr (loc, NULL_RTX, ptr_mode, EXPAND_SUM);
> 
> Before this patch we used ptr_mode, which does the right thing for
> AArch64 ILP32.
> Do you think we should just be implementing
> targetm.addr_space.address_mode for AArch64 to return SImode for ILP32?
Possibly.   Is there any fallout from making that change?

Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2019-09-03 14:01   ` [PATCH 02/25] Propagate address spaces to builtins Kyrill Tkachov
  2019-09-03 15:00     ` Jeff Law
@ 2019-09-03 15:43     ` Andrew Stubbs
  1 sibling, 0 replies; 187+ messages in thread
From: Andrew Stubbs @ 2019-09-03 15:43 UTC (permalink / raw)
  To: Kyrill Tkachov, gcc-patches, richard.henderson

On 03/09/2019 15:01, Kyrill Tkachov wrote:
> Sorry for responding to this so late. I'm testing a rebased version of 
> Richard's OOL atomic patches [1] and am hitting an ICE building the 
> -mabi=ilp32 libgfortran multilib for aarch64-none-elf:

I thought Andreas already fixed ILP32.

https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01439.html

Andrew

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2019-09-03 15:00     ` Jeff Law
@ 2019-09-04 14:21       ` Kyrill Tkachov
  2019-09-04 15:29         ` Kyrill Tkachov
  0 siblings, 1 reply; 187+ messages in thread
From: Kyrill Tkachov @ 2019-09-04 14:21 UTC (permalink / raw)
  To: Jeff Law, ams, gcc-patches, richard.henderson


On 9/3/19 4:00 PM, Jeff Law wrote:
> On 9/3/19 8:01 AM, Kyrill Tkachov wrote:
>> Hi all,
>>
>> On 9/5/18 12:48 PM, ams@codesourcery.com wrote:
>>> At present, pointers passed to builtin functions, including atomic
>>> operators,
>>> are stripped of their address space properties.  This doesn't seem to be
>>> deliberate, it just omits to copy them.
>>>
>>> Not only that, but it forces pointer sizes to Pmode, which isn't
>>> appropriate
>>> for all address spaces.
>>>
>>> This patch attempts to correct both issues.  It works for GCN atomics and
>>> GCN OpenACC gang-private variables.
>>>
>>> 2018-09-05  Andrew Stubbs  <ams@codesourcery.com>
>>>              Julian Brown  <julian@codesourcery.com>
>>>
>>>          gcc/
>>>          * builtins.c (get_builtin_sync_mem): Handle address spaces.
>>
>> Sorry for responding to this so late. I'm testing a rebased version of
>> Richard's OOL atomic patches [1] and am hitting an ICE building the
>> -mabi=ilp32 libgfortran multilib for aarch64-none-elf:
>>
>> 0x7284db emit_library_call_value_1(int, rtx_def*, rtx_def*,
>> libcall_type, machine_mode, int, std::pair<rtx_def*, machine_mode>*)
>>          $SRC/gcc/calls.c:4915
>> 0x1037817 emit_library_call_value(rtx_def*, rtx_def*, libcall_type,
>> machine_mode, rtx_def*, machine_mode, rtx_def*, machine_mode, rtx_def*,
>> machine_mode)
>>          $SRC/gcc/rtl.h:4240
>> 0x1037817 aarch64_expand_compare_and_swap(rtx_def**)
>>          $SRC/gcc/config/aarch64/aarch64.c:16981
>> 0x1353a43 gen_atomic_compare_and_swapsi(rtx_def*, rtx_def*, rtx_def*,
>> rtx_def*, rtx_def*, rtx_def*, rtx_def*, rtx_def*)
>>          $SRC/gcc/config/aarch64/atomics.md:34
>> 0xb1f9f1 insn_gen_fn::operator()(rtx_def*, rtx_def*, rtx_def*, rtx_def*,
>> rtx_def*, rtx_def*, rtx_def*, rtx_def*) const
>>          $SRC/gcc/recog.h:324
>> 0xb1f9f1 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
>>          $SRC/gcc/optabs.c:7443
>> 0xb1fa78 maybe_expand_insn(insn_code, unsigned int, expand_operand*)
>>          $SRC/gcc/optabs.c:7459
>> 0xb21024 expand_atomic_compare_and_swap(rtx_def**, rtx_def**, rtx_def*,
>> rtx_def*, rtx_def*, bool, memmodel, memmodel)
>>          $SRC/gcc/optabs.c:6448
>> 0x709bd3 expand_builtin_atomic_compare_exchange
>>          $SRC/gcc/builtins.c:6379
>> 0x71a4e9 expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
>>          $SRC/gcc/builtins.c:8147
>> 0x88b746 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>> expand_modifier, rtx_def**, bool)
>>          $SRC/gcc/expr.c:11052
>> 0x88cce6 expand_expr_real(tree_node*, rtx_def*, machine_mode,
>> expand_modifier, rtx_def**, bool)
>>          $SRC/gcc/expr.c:8289
>> 0x74cb47 expand_expr
>>          $SRC/gcc/expr.h:281
>> 0x74cb47 expand_call_stmt
>>          $SRC/gcc/cfgexpand.c:2731
>> 0x74cb47 expand_gimple_stmt_1
>>          $SRC/gcc/cfgexpand.c:3710
>> 0x74cb47 expand_gimple_stmt
>>          $SRC/gcc/cfgexpand.c:3875
>> 0x75439b expand_gimple_basic_block
>>          $SRC/gcc/cfgexpand.c:5915
>> 0x7563ab execute
>>          $SRC/gcc/cfgexpand.c:6538
>> Please submit a full bug report,
>> with preprocessed source if appropriate.
>> Please include the complete backtrace with any bug report.
>> See <https://gcc.gnu.org/bugs/> for instructions.
>>
>> A MEM rtx now uses a DImode address where for ILP32 we expect SImode.
>>
>> This looks to be because....
>>
>> [1] https://gcc.gnu.org/ml/gcc-patches/2018-11/msg00062.html
>>
>>
>>> ---
>>>   gcc/builtins.c | 13 ++++++++++---
>>>   1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>> 0002-Propagate-address-spaces-to-builtins.patch
>>
>> diff --git a/gcc/builtins.c b/gcc/builtins.c
>> index 58ea747..361361c 100644
>> --- a/gcc/builtins.c
>> +++ b/gcc/builtins.c
>> @@ -5781,14 +5781,21 @@ static rtx
>>   get_builtin_sync_mem (tree loc, machine_mode mode)
>>   {
>>     rtx addr, mem;
>> +  int addr_space = TYPE_ADDR_SPACE (POINTER_TYPE_P (TREE_TYPE (loc))
>> +                    ? TREE_TYPE (TREE_TYPE (loc))
>> +                    : TREE_TYPE (loc));
>> +  scalar_int_mode addr_mode = targetm.addr_space.address_mode
>> (addr_space);
>>   
>> ... This now returns Pmode (the default for the hook) for aarch64 ILP32,
>> which is always DImode.
>>
>> -  addr = expand_expr (loc, NULL_RTX, ptr_mode, EXPAND_SUM);
>>
>> Before this patch we used ptr_mode, which does the right thing for
>> AArch64 ILP32.
>> Do you think we should just be implementing
>> targetm.addr_space.address_mode for AArch64 to return SImode for ILP32?
> Possibly.   Is there any fallout from making that change?

Unfortunately some ICEs when building libgcc with POST_INC arguments 
output :(

I'll need to dig further.

Thanks,

Kyrill


>
> Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/25] Propagate address spaces to builtins.
  2019-09-04 14:21       ` Kyrill Tkachov
@ 2019-09-04 15:29         ` Kyrill Tkachov
  0 siblings, 0 replies; 187+ messages in thread
From: Kyrill Tkachov @ 2019-09-04 15:29 UTC (permalink / raw)
  To: Jeff Law, ams, gcc-patches, richard.henderson


On 9/4/19 3:21 PM, Kyrill Tkachov wrote:
>
> On 9/3/19 4:00 PM, Jeff Law wrote:
> > On 9/3/19 8:01 AM, Kyrill Tkachov wrote:
> >> Hi all,
> >>
> >> On 9/5/18 12:48 PM, ams@codesourcery.com wrote:
> >>> At present, pointers passed to builtin functions, including atomic
> >>> operators,
> >>> are stripped of their address space properties.  This doesn't seem 
> to be
> >>> deliberate, it just omits to copy them.
> >>>
> >>> Not only that, but it forces pointer sizes to Pmode, which isn't
> >>> appropriate
> >>> for all address spaces.
> >>>
> >>> This patch attempts to correct both issues.  It works for GCN 
> atomics and
> >>> GCN OpenACC gang-private variables.
> >>>
> >>> 2018-09-05  Andrew Stubbs <ams@codesourcery.com>
> >>>              Julian Brown <julian@codesourcery.com>
> >>>
> >>>          gcc/
> >>>          * builtins.c (get_builtin_sync_mem): Handle address spaces.
> >>
> >> Sorry for responding to this so late. I'm testing a rebased version of
> >> Richard's OOL atomic patches [1] and am hitting an ICE building the
> >> -mabi=ilp32 libgfortran multilib for aarch64-none-elf:
> >>
> >> 0x7284db emit_library_call_value_1(int, rtx_def*, rtx_def*,
> >> libcall_type, machine_mode, int, std::pair<rtx_def*, machine_mode>*)
> >>          $SRC/gcc/calls.c:4915
> >> 0x1037817 emit_library_call_value(rtx_def*, rtx_def*, libcall_type,
> >> machine_mode, rtx_def*, machine_mode, rtx_def*, machine_mode, rtx_def*,
> >> machine_mode)
> >>          $SRC/gcc/rtl.h:4240
> >> 0x1037817 aarch64_expand_compare_and_swap(rtx_def**)
> >>          $SRC/gcc/config/aarch64/aarch64.c:16981
> >> 0x1353a43 gen_atomic_compare_and_swapsi(rtx_def*, rtx_def*, rtx_def*,
> >> rtx_def*, rtx_def*, rtx_def*, rtx_def*, rtx_def*)
> >>          $SRC/gcc/config/aarch64/atomics.md:34
> >> 0xb1f9f1 insn_gen_fn::operator()(rtx_def*, rtx_def*, rtx_def*, 
> rtx_def*,
> >> rtx_def*, rtx_def*, rtx_def*, rtx_def*) const
> >>          $SRC/gcc/recog.h:324
> >> 0xb1f9f1 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
> >>          $SRC/gcc/optabs.c:7443
> >> 0xb1fa78 maybe_expand_insn(insn_code, unsigned int, expand_operand*)
> >>          $SRC/gcc/optabs.c:7459
> >> 0xb21024 expand_atomic_compare_and_swap(rtx_def**, rtx_def**, rtx_def*,
> >> rtx_def*, rtx_def*, bool, memmodel, memmodel)
> >>          $SRC/gcc/optabs.c:6448
> >> 0x709bd3 expand_builtin_atomic_compare_exchange
> >>          $SRC/gcc/builtins.c:6379
> >> 0x71a4e9 expand_builtin(tree_node*, rtx_def*, rtx_def*, 
> machine_mode, int)
> >>          $SRC/gcc/builtins.c:8147
> >> 0x88b746 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >> expand_modifier, rtx_def**, bool)
> >>          $SRC/gcc/expr.c:11052
> >> 0x88cce6 expand_expr_real(tree_node*, rtx_def*, machine_mode,
> >> expand_modifier, rtx_def**, bool)
> >>          $SRC/gcc/expr.c:8289
> >> 0x74cb47 expand_expr
> >>          $SRC/gcc/expr.h:281
> >> 0x74cb47 expand_call_stmt
> >>          $SRC/gcc/cfgexpand.c:2731
> >> 0x74cb47 expand_gimple_stmt_1
> >>          $SRC/gcc/cfgexpand.c:3710
> >> 0x74cb47 expand_gimple_stmt
> >>          $SRC/gcc/cfgexpand.c:3875
> >> 0x75439b expand_gimple_basic_block
> >>          $SRC/gcc/cfgexpand.c:5915
> >> 0x7563ab execute
> >>          $SRC/gcc/cfgexpand.c:6538
> >> Please submit a full bug report,
> >> with preprocessed source if appropriate.
> >> Please include the complete backtrace with any bug report.
> >> See <https://gcc.gnu.org/bugs/> for instructions.
> >>
> >> A MEM rtx now uses a DImode address where for ILP32 we expect SImode.
> >>
> >> This looks to be because....
> >>
> >> [1] https://gcc.gnu.org/ml/gcc-patches/2018-11/msg00062.html
> >>
> >>
> >>> ---
> >>>   gcc/builtins.c | 13 ++++++++++---
> >>>   1 file changed, 10 insertions(+), 3 deletions(-)
> >>>
> >> 0002-Propagate-address-spaces-to-builtins.patch
> >>
> >> diff --git a/gcc/builtins.c b/gcc/builtins.c
> >> index 58ea747..361361c 100644
> >> --- a/gcc/builtins.c
> >> +++ b/gcc/builtins.c
> >> @@ -5781,14 +5781,21 @@ static rtx
> >>   get_builtin_sync_mem (tree loc, machine_mode mode)
> >>   {
> >>     rtx addr, mem;
> >> +  int addr_space = TYPE_ADDR_SPACE (POINTER_TYPE_P (TREE_TYPE (loc))
> >> +                    ? TREE_TYPE (TREE_TYPE (loc))
> >> +                    : TREE_TYPE (loc));
> >> +  scalar_int_mode addr_mode = targetm.addr_space.address_mode
> >> (addr_space);
> >>
> >> ... This now returns Pmode (the default for the hook) for aarch64 
> ILP32,
> >> which is always DImode.
> >>
> >> -  addr = expand_expr (loc, NULL_RTX, ptr_mode, EXPAND_SUM);
> >>
> >> Before this patch we used ptr_mode, which does the right thing for
> >> AArch64 ILP32.
> >> Do you think we should just be implementing
> >> targetm.addr_space.address_mode for AArch64 to return SImode for ILP32?
> > Possibly.   Is there any fallout from making that change?
>
> Unfortunately some ICEs when building libgcc with POST_INC arguments
> output :(
>
> I'll need to dig further.

Adding a convert_memory_address to ptr_mode before each call to 
emit_library_call_value in the OOL atomics code to convert the addresses 
does fix it the ICEs.

Let's see what testing shows.

Kyrill



>
> Thanks,
>
> Kyrill
>
>
> >
> > Jeff

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/25] Ignore LLVM's blank lines.
  2018-09-14 16:19   ` Jeff Law
@ 2020-03-23 15:29     ` Thomas Schwinge
  2020-03-24 21:05       ` Thomas Schwinge
  0 siblings, 1 reply; 187+ messages in thread
From: Thomas Schwinge @ 2020-03-23 15:29 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: Jeff Law, gcc-patches, Jakub Jelinek

Hi!

Now that I'm looking into enabling AMD GCN offloading in my builds, I
have a few concerns here.  ;-)

On 2018-09-14T10:18:12-0600, Jeff Law <law@redhat.com> wrote:
> On 9/5/18 5:52 AM, ams@codesourcery.com wrote:
>>
>> The GCN toolchain must use the LLVM assembler and linker because there's no
>> binutils port.  The LLVM tools do not have the same diagnostic style as
>> binutils

For reference:
'libgomp.c++/../libgomp.c-c++-common/function-not-offloaded.c', for
example:

    ld: error: undefined symbol: foo()
    >>> referenced by /tmp/ccNzknBD.o:(main._omp_fn.0)
    >>> referenced by /tmp/ccNzknBD.o:(main._omp_fn.0)

    ld: error: undefined symbol: __gxx_personality_v0
    >>> referenced by /tmp/ccNzknBD.o:(.data+0x13)
    collect2: error: ld returned 1 exit status
    mkoffload: fatal error: [...]/build-gcc/./gcc/x86_64-pc-linux-gnu-accel-amdgcn-amdhsa-gcc returned 1 exit status

Note the blank line between the two "diagnostic blocks".

>> so the "blank line(s) in output" tests are inappropriate (and very
>> noisy).

>>      gcc/testsuite/

>>      * lib/gcc-dg.exp (gcc-dg-prune): Ignore blank lines from the LLVM
>>      linker.
>>      * lib/target-supports.exp (check_effective_target_llvm_binutils): New.

> This is fine.  It's a NOP for other targets, so feel free to commit when
> it's convenient for you.

See below, is it really "a NOP for other targets"?

| --- a/gcc/testsuite/lib/gcc-dg.exp
| +++ b/gcc/testsuite/lib/gcc-dg.exp
| @@ -361,7 +361,7 @@ proc gcc-dg-prune { system text } {
|
|      # Complain about blank lines in the output (PR other/69006)
|      global allow_blank_lines
| -    if { !$allow_blank_lines } {
| +    if { !$allow_blank_lines && ![check_effective_target_llvm_binutils]} {
|       set num_blank_lines [llength [regexp -all -inline "\n\n" $text]]
|       if { $num_blank_lines } {
|           global testname_with_flags

(That got re-worked a bit, per <https://gcc.gnu.org/PR88920>.)

| --- a/gcc/testsuite/lib/target-supports.exp
| +++ b/gcc/testsuite/lib/target-supports.exp
| @@ -9129,6 +9129,14 @@ proc check_effective_target_offload_hsa { } {
|      } "-foffload=hsa" ]
|  }
|
| +# Return 1 if the compiler has been configured with hsa offloading.

(Is this conceptually really appropriate to have here in
'gcc/testsuite/', given that all 'mkoffload'-based offloading testing
happens in libgomp only?)  (And, 'hsa' copy'n'pasto should be 'amdgcn'?)

| +
| +proc check_effective_target_offload_gcn { } {
| +    return [check_no_compiler_messages offload_gcn assembly {
| +     int main () {return 0;}
| +    } "-foffload=amdgcn-unknown-amdhsa" ]
| +}

This is too specific: "amdgcn-unknown-amdhsa" is often spelled just
"amdgcn-amdhsa", for example.

Our current '-foffload' syntax is not amenable to such variations; we
really need to re-work that.  (That's also one of the reasons why
<https://gcc.gnu.org/PR67300> "-foffload* undocumented" is still open,
and hasn't seen any further work: the '-foffload' as we've currently got
it implemented is somewhat useful, yes, but needs to be improved to be
really useful for users -- as well as for ourselves, as we're seeing
here.)  For example, consider you'd like to do offloading compilation to
include code for two different AMD GCN ISAs (fat binaries).  Or, once
that is supported for OpenACC, "offloading" to several different
multicore CPU ISAs/variations.  Can't specify such things given the
current syntax; we need some kind of abstraction from offload target.
(But that's a separate discussion, of course.)

Anyway, per the current code here, if instead of amdgcn-unknown-amdhsa I
configure GCC for amdgcn-amdhsa offload target, this test will not work.

| +# Return 1 if this target uses an LLVM assembler and/or linker
| +proc check_effective_target_llvm_binutils { } {
| +    return [expr { [istarget amdgcn*-*-*]
| +                || [check_effective_target_offload_gcn] } ]
| +}

Unless I'm understanding something wrong, this (second condition here)
means that all the checking added for <https://gcc.gnu.org/PR69006>
"Extraneous newline emitted between error messages in GCC 6" will be void
as soon as GCC is configured for AMD GCN offloading.  (This
effective-target here doesn't just apply to AMD GCN offloading
compliation, but to *all* standard GCC testsuite checking, doesn't it?)
That seems problematic to me: conceptually, but also in practice, given
that more and more users (for example, major GNU/Linux distributions) are
enabling GCC offloading compilation.


So here is a different proposal.  The problem we're having is that the
AMD GCN 'as' (that is, 'llvm-mc') prints "unsuitable" diagnostics (as
quoted at the top of my email).

How about having a simple wrapper around it, to post-process its 'stderr'
to remove any blank linkes between "diagnostic blocks"?  Then we could
remove this fragile 'check_effective_target_llvm_binutils' etc.?

I shall offer to implement the simple shell script, and suppose this
could live in 'gcc/config/gcn/'?  ("Just" need to figure out how to
integrate that into the GCC build process, top-level build system.)


Grüße
 Thomas
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/25] Ignore LLVM's blank lines.
  2020-03-23 15:29     ` Thomas Schwinge
@ 2020-03-24 21:05       ` Thomas Schwinge
  0 siblings, 0 replies; 187+ messages in thread
From: Thomas Schwinge @ 2020-03-24 21:05 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches; +Cc: Jeff Law, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2415 bytes --]

Hi!

On 2020-03-23T16:29:40+0100, I wrote:
> On 2018-09-14T10:18:12-0600, Jeff Law <law@redhat.com> wrote:
>> On 9/5/18 5:52 AM, ams@codesourcery.com wrote:
>>>
>>> The GCN toolchain must use the LLVM assembler and linker because there's no
>>> binutils port.  The LLVM tools do not have the same diagnostic style as
>>> binutils
>
> For reference:
> 'libgomp.c++/../libgomp.c-c++-common/function-not-offloaded.c', for
> example:
>
>     ld: error: undefined symbol: foo()
>     >>> referenced by /tmp/ccNzknBD.o:(main._omp_fn.0)
>     >>> referenced by /tmp/ccNzknBD.o:(main._omp_fn.0)
>
>     ld: error: undefined symbol: __gxx_personality_v0
>     >>> referenced by /tmp/ccNzknBD.o:(.data+0x13)
>     collect2: error: ld returned 1 exit status
>     mkoffload: fatal error: [...]/build-gcc/./gcc/x86_64-pc-linux-gnu-accel-amdgcn-amdhsa-gcc returned 1 exit status
>
> Note the blank line between the two "diagnostic blocks".
>
>>> so the "blank line(s) in output" tests are inappropriate (and very
>>> noisy).

> So here is a different proposal.  The problem we're having is that the
> AMD GCN 'as' (that is, 'llvm-mc') prints "unsuitable" diagnostics (as
> quoted at the top of my email).

(No idea where I got the idea from that 'as' prints 'ld' error messages,
as displayed above...)  ;-) (But supposedly that problem applies to both,
or even generally all LLVM tools?)

> How about having a simple wrapper around it, to post-process its 'stderr'
> to remove any blank linkes between "diagnostic blocks"?  Then we could
> remove this fragile 'check_effective_target_llvm_binutils' etc.?
>
> I shall offer to implement the simple shell script, and suppose this
> could live in 'gcc/config/gcn/'?

I have implemented that, and it appears to generally work, but of
course...

> ("Just" need to figure out how to
> integrate that into the GCC build process, top-level build system.)

... this was the non-trivial bit, and may need some further thought --
which I can't allocate time for right now, so I'll postpone this for
later.  I'm attaching my "[WIP] 'llvm-tools-wrapper' [PR88920]" patch, in
case somebody would like to have a first look.


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-WIP-llvm-tools-wrapper-PR88920.patch --]
[-- Type: text/x-diff, Size: 15086 bytes --]

From 31c8828d37bed37b4ee4eb3bbefb8eb1db46e0a1 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 24 Mar 2020 21:59:04 +0100
Subject: [PATCH] [WIP] 'llvm-tools-wrapper' [PR88920]

	PR testsuite/88920
---
 gcc/Makefile.in                               |   9 ++
 gcc/config.in                                 |   6 +
 gcc/configure                                 |  17 +++
 gcc/configure.ac                              |  12 ++
 gcc/doc/sourcebuild.texi                      |   6 -
 gcc/gcc.c                                     | 116 +++++++++++++++---
 gcc/llvm-tools-wrapper.in                     |  35 ++++++
 gcc/testsuite/lib/gcc-dg.exp                  |   4 -
 gcc/testsuite/lib/target-supports.exp         |  15 ---
 .../function-not-offloaded.c                  |   1 -
 10 files changed, 175 insertions(+), 46 deletions(-)
 create mode 100755 gcc/llvm-tools-wrapper.in

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index fa9923bb2703..4d13707453b2 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1974,6 +1974,7 @@ start.encap: native xgcc$(exeext) cpp$(exeext) specs \
 rest.encap: lang.rest.encap
 # This is what is made with the host's compiler
 # whether making a cross compiler or not.
+#TODO Do we need 'llvm-tools-wrapper' here?
 native: config.status auto-host.h build-@POSUB@ $(LANGUAGES) \
 	$(EXTRA_PROGRAMS) $(COLLECT2) lto-wrapper$(exeext) \
 	gcc-ar$(exeext) gcc-nm$(exeext) gcc-ranlib$(exeext)
@@ -3525,6 +3526,10 @@ ifeq ($(enable_plugin),yes)
 install: install-plugin
 endif
 
+#TODO ifeq (TODO)
+install: install-llvm-tools-wrapper
+#TODO endif
+
 install-strip: override INSTALL_PROGRAM = $(INSTALL_STRIP_PROGRAM)
 ifneq ($(STRIP),)
 install-strip: STRIPPROG = $(STRIP)
@@ -3920,6 +3925,10 @@ install-gcc-ar: installdirs gcc-ar$(exeext) gcc-nm$(exeext) gcc-ranlib$(exeext)
 	  done; \
 	fi
 
+# Install llvm-tools-wrapper.
+install-llvm-tools-wrapper: llvm-tools-wrapper$(exeext)
+	$(INSTALL_PROGRAM) llvm-tools-wrapper$(exeext) $(DESTDIR)$(libexecsubdir)/llvm-tools-wrapper$(exeext)
+
 # Cancel installation by deleting the installed files.
 uninstall: lang.uninstall
 	-rm -rf $(DESTDIR)$(libsubdir)
diff --git a/gcc/config.in b/gcc/config.in
index 01fb18dbbb5a..ab51b9aef898 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -2042,6 +2042,12 @@
 #endif
 
 
+/* Define if using LLVM tools. */
+#ifndef USED_FOR_TARGET
+#undef LLVM_TOOLS_WRAPPER
+#endif
+
+
 /* Define to the name of the LTO plugin DSO that must be passed to the
    linker's -plugin=LIB option. */
 #ifndef USED_FOR_TARGET
diff --git a/gcc/configure b/gcc/configure
index 5381e107bce7..06f46c4bd6b0 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -22994,6 +22994,21 @@ else
 $as_echo "$gcc_cv_otool" >&6; }
 fi
 
+# Do we need the llvm-tools-wrapper?
+#TODO Make this a feature test, or is that good enough?
+case $target in
+  amdgcn*)
+
+$as_echo "#define LLVM_TOOLS_WRAPPER 1" >>confdefs.h
+
+    #TODO Does this need any 'Makefile.in' support for stating regeneration dependencies?
+    #TODO This is modelled after the 'exec-tool.in' stuff.
+    ac_config_files="$ac_config_files llvm-tools-wrapper:llvm-tools-wrapper.in"
+
+    ;;
+esac
+
+
 # Figure out what assembler alignment features are present.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler flags" >&5
 $as_echo_n "checking assembler flags... " >&6; }
@@ -31460,6 +31475,7 @@ do
     "as") CONFIG_FILES="$CONFIG_FILES as:exec-tool.in" ;;
     "collect-ld") CONFIG_FILES="$CONFIG_FILES collect-ld:exec-tool.in" ;;
     "nm") CONFIG_FILES="$CONFIG_FILES nm:exec-tool.in" ;;
+    "llvm-tools-wrapper") CONFIG_FILES="$CONFIG_FILES llvm-tools-wrapper:llvm-tools-wrapper.in" ;;
     "clearcap.map") CONFIG_LINKS="$CONFIG_LINKS clearcap.map:${srcdir}/config/$clearcap_map" ;;
     "$all_outputs") CONFIG_FILES="$CONFIG_FILES $all_outputs" ;;
     "default") CONFIG_COMMANDS="$CONFIG_COMMANDS default" ;;
@@ -32094,6 +32110,7 @@ $as_echo "$as_me: executing $ac_file commands" >&6;}
     "as":F) chmod +x as ;;
     "collect-ld":F) chmod +x collect-ld ;;
     "nm":F) chmod +x nm ;;
+    "llvm-tools-wrapper":F) chmod +x llvm-tools-wrapper ;;
     "default":C)
 case ${CONFIG_HEADERS} in
   *auto-host.h:config.in*)
diff --git a/gcc/configure.ac b/gcc/configure.ac
index 0d6230e0ca1b..0f3fcfefd4fc 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -2685,6 +2685,18 @@ else
 	AC_MSG_RESULT($gcc_cv_otool)
 fi
 
+# Do we need the llvm-tools-wrapper?
+#TODO Make this a feature test, or is that good enough?
+case $target in
+  amdgcn*)
+    AC_DEFINE(LLVM_TOOLS_WRAPPER, 1, [Define if using LLVM tools.])
+    #TODO Does this need any 'Makefile.in' support for stating regeneration dependencies?
+    #TODO This is modelled after the 'exec-tool.in' stuff.
+    AC_CONFIG_FILES(llvm-tools-wrapper:llvm-tools-wrapper.in, [chmod +x llvm-tools-wrapper])
+    ;;
+esac
+
+
 # Figure out what assembler alignment features are present.
 gcc_GAS_CHECK_FEATURE([.balign and .p2align], gcc_cv_as_balign_and_p2align,
  [2,6,0],,
diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index eef1432147ce..d6057656c6ef 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -2333,9 +2333,6 @@ Target uses GNU @command{ld}.
 Target keeps null pointer checks, either due to the use of
 @option{-fno-delete-null-pointer-checks} or hardwired into the target.
 
-@item llvm_binutils
-Target is using an LLVM assembler and/or linker, instead of GNU Binutils.
-
 @item lto
 Compiler has been configured to support link-time optimization (LTO).
 
@@ -2363,9 +2360,6 @@ Target supports the @code{noinit} variable attribute.
 @item nonpic
 Target does not generate PIC by default.
 
-@item offload_gcn
-Target has been configured for OpenACC/OpenMP offloading on AMD GCN.
-
 @item pie_enabled
 Target generates PIE by default.
 
diff --git a/gcc/gcc.c b/gcc/gcc.c
index 9f790db0daf4..eb0fb836ffc0 100644
--- a/gcc/gcc.c
+++ b/gcc/gcc.c
@@ -3038,11 +3038,18 @@ execute (void)
 
   gcc_assert (!processing_spec_function);
 
+  /* TODO Why is this wrapping here done "early", instead of "late", as done
+     for the 'valgrind' wrapping?  */
   if (wrapper_string)
     {
       string = find_a_file (&exec_prefixes,
 			    argbuf[0], X_OK, false);
       if (string)
+	/* This overwrites 'argbuf[0]', thus later 'commands[0].prog' is set to
+	   the resolved 'string' instead of the original 'argbuf[0]'.  This
+	   means that any special handling for 'argbuf[0]' can no longer be
+	   expected to work (such as when 'find_a_file' compares this to 'as'
+	   or 'ld').  */
 	argbuf[0] = string;
       insert_wrapper (wrapper_string);
     }
@@ -3064,6 +3071,7 @@ execute (void)
   commands[0].prog = argbuf[0]; /* first command.  */
   commands[0].argv = argbuf.address ();
 
+  /* TODO For 'wrapper_string', this has already been done above.  */
   if (!wrapper_string)
     {
       string = find_a_file (&exec_prefixes, commands[0].prog, X_OK, false);
@@ -3164,32 +3172,100 @@ execute (void)
 #endif /* DEBUG */
     }
 
-#ifdef ENABLE_VALGRIND_CHECKING
-  /* Run the each command through valgrind.  To simplify prepending the
-     path to valgrind and the option "-q" (for quiet operation unless
-     something triggers), we allocate a separate argv array.  */
-
+  /* Possibly run each command through a wrapper.  */
+  //TODO Why is this done after '-v' handling?
   for (i = 0; i < n_commands; i++)
     {
-      const char **argv;
-      int argc;
-      int j;
-
-      for (argc = 0; commands[i].argv[argc] != NULL; argc++)
-	;
+      size_t argc_wrap_llvm_tools_wrapper = 0;
+#ifdef LLVM_TOOLS_WRAPPER //TODO Conditional whether assembler (?TODO), linker (even separately?) are LLVM tools.
+      //TODO For offloading compilation (I haven't tried native yet), something isn't working: despite passing in '-fdiagnostics-color=never', the 'lld' wrapper invocation is still done with '-fdiagnostics-color=always' if printing to a terminal -- but does turn into '-fdiagnostics-color=never' if not printing to a terminal.
+      //TODO Does that mean that 'diagnostic_color_init' is only doing the 'DIAGNOSTICS_COLOR_DEFAULT' thing, but not paying attention to the command-line argument?
+      //TODO Per PR69707, that is supposed to be working?
+      //TODO Is that maybe a problem in 'mkoffload'?  (..., and if yes, just amdgcn, or all?)
+      const char *llvm_tools_wrapper_color_arg = NULL;
+#if 0 //TODO Do this for 'as', too?
+      if (!strcmp (commands[i].prog, "as"))
+	{
+	  /* That's 'llvm-mc'.  */
+	  argc_wrap_llvm_tools_wrapper = 1;
+	  if (pp_show_color (global_dc->printer))
+	    {
+	      argc_wrap_llvm_tools_wrapper += 1;
+	      llvm_tools_wrapper_color_arg = "--color";
+	    }
+	  else
+	    ; /* There doesn't seem to be a command-line flag to force-disable
+		 this, so let's hope the default does the right thing.  */
+	}
+      else
+#endif
+	if (!strcmp (commands[i].prog, linker_name_spec))
+        {
+	  /* That's 'lld'.  */
+	  argc_wrap_llvm_tools_wrapper = 1;
+	  if (pp_show_color (global_dc->printer))
+	    {
+	      argc_wrap_llvm_tools_wrapper += 1;
+	      llvm_tools_wrapper_color_arg = "--color-diagnostics=always";
+	    }
+	  else
+	    {
+	      argc_wrap_llvm_tools_wrapper += 1;
+	      llvm_tools_wrapper_color_arg = "--color-diagnostics=never";
+	    }
+	}
+#endif /* LLVM_TOOLS_WRAPPER */
+    size_t argc_wrap_valgrind = 0;
+#ifdef ENABLE_VALGRIND_CHECKING
+      argc_wrap_valgrind = 2;
+#endif
+      size_t argc_extra = argc_wrap_llvm_tools_wrapper + argc_wrap_valgrind;
+      if (argc_extra > 0)
+	{
+	  /* One extra for the terminator.  */
+	  argc_extra += 1;
 
-      argv = XALLOCAVEC (const char *, argc + 3);
+	  size_t argc;
+	  for (argc = 0; commands[i].argv[argc] != NULL; argc++)
+	    ;
 
-      argv[0] = VALGRIND_PATH;
-      argv[1] = "-q";
-      for (j = 2; j < argc + 2; j++)
-	argv[j] = commands[i].argv[j - 2];
-      argv[j] = NULL;
+	  const char **argv = XALLOCAVEC (const char *, argc_extra + argc);
+	  //TODO Where does this get deallocated?
 
-      commands[i].argv = argv;
-      commands[i].prog = argv[0];
-    }
+	  size_t j = 0;
+	  if (argc_wrap_llvm_tools_wrapper)
+	    {
+	      //TODO This uses 'find_a_file' to locate this in the build directory.  This supposedly should also work for the install tree.  'find_a_file' should be doing/supporting all that; this is modelled similar to 'lto-wrapper'.
+	      argv[j++] = find_a_file (&exec_prefixes, "llvm-tools-wrapper", X_OK, false);
+	    }
+	  if (argc_wrap_valgrind)
+	    {
+#ifdef ENABLE_VALGRIND_CHECKING
+	      /* Run the each command through valgrind.  */
+	      argv[j++] = VALGRIND_PATH;
+	      /* Request quiet operation unless something triggers.  */
+	      argv[j++] = "-q";
+#else
+	      gcc_unreachable ();
 #endif
+	    }
+	  /* Now the actual executable.  */
+	  argv[j++] = commands[i].argv[0];
+#ifdef LLVM_TOOLS_WRAPPER
+	  /* Now any LLVM tools color arguments.  */
+	  if (llvm_tools_wrapper_color_arg)
+	    argv[j++] = llvm_tools_wrapper_color_arg;
+#endif /* LLVM_TOOLS_WRAPPER */
+	  /* Now all other arguments.  */
+	  for (size_t arg = 1; arg < argc; ++arg)
+	    argv[j++] = commands[i].argv[arg];
+	  argv[j++] = NULL;
+	  gcc_checking_assert (j == argc + argc_extra);
+
+	  commands[i].argv = argv;
+	  commands[i].prog = argv[0];
+	}
+    }
 
   /* Run each piped subprocess.  */
 
diff --git a/gcc/llvm-tools-wrapper.in b/gcc/llvm-tools-wrapper.in
new file mode 100755
index 000000000000..a9233209082a
--- /dev/null
+++ b/gcc/llvm-tools-wrapper.in
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+# This is a simple wrapper that invokes '$@', and filters 'stderr' such as to
+# only print non-empty lines.
+
+# Doing this, the invoked program may notice that it's not printing to a
+# terminal, and may change its behavior, say, to disable color output.  If that
+# is a concern, it has to be addressed individually.
+
+# This implementation depends on the GNU Bash, in particular its 'coproc'
+# command.  We expect this wrapper only to be used on systems where such
+# dependencies are not a concern.
+
+set -e
+
+# Set up a coprocess: print non-empty lines to 'stderr'.
+coproc grep --line-buffered . >&2
+
+# Invoke '$@'.
+status=0
+"$@" 2>&"${COPROC[1]}" || status="$?"
+
+# Shut down the coprocess.
+# Apparently, per
+# <https://unix.stackexchange.com/questions/86270/how-do-you-use-the-command-coproc-in-various-shells>,
+# we have to use such an indirection ('coproc_fd_1') to support "bash versions
+# prior to 4.3".
+coproc_fd_1=${COPROC[1]}
+exec {coproc_fd_1}<&-
+# We intentionally do not 'wait' for '$COPROC_PID' here, as that sometimes
+# fails (perhaps if the coprocess has never been active, if there has been no
+# output at all?).
+wait
+
+exit "$status"
diff --git a/gcc/testsuite/lib/gcc-dg.exp b/gcc/testsuite/lib/gcc-dg.exp
index cccd3ce4742c..941fc72b5947 100644
--- a/gcc/testsuite/lib/gcc-dg.exp
+++ b/gcc/testsuite/lib/gcc-dg.exp
@@ -342,10 +342,6 @@ proc gcc-dg-test { prog do_what extra_tool_flags } {
 # for all tests.
 set allow_blank_lines 0
 
-if { [check_effective_target_llvm_binutils] } {
-    set allow_blank_lines 2
-}
-
 # A command for use by testcases to mark themselves as expecting
 # blank lines in the output.
 
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index ca3895c22690..089ecfb68ad4 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -9514,14 +9514,6 @@ proc check_effective_target_offload_hsa { } {
     } "-foffload=hsa" ]
 }
 
-# Return 1 if the compiler has been configured with hsa offloading.
-
-proc check_effective_target_offload_gcn { } {
-    return [check_no_compiler_messages offload_gcn assembly {
-	int main () {return 0;}
-    } "-foffload=amdgcn-unknown-amdhsa" ]
-}
-
 # Return 1 if the target support -fprofile-update=atomic
 proc check_effective_target_profile_update_atomic {} {
     return [check_no_compiler_messages profile_update_atomic assembly {
@@ -9963,13 +9955,6 @@ foreach N {df} {
     }]
 }
 
-# Return 1 if this target uses an LLVM assembler and/or linker
-proc check_effective_target_llvm_binutils { } {
-    return [check_cached_effective_target llvm_binutils {
-	      expr { [istarget amdgcn*-*-*]
-		     || [check_effective_target_offload_gcn] }}]
-}
-
 # Return 1 if the compiler supports '-mfentry'.
 
 proc check_effective_target_mfentry { } {
diff --git a/libgomp/testsuite/libgomp.c-c++-common/function-not-offloaded.c b/libgomp/testsuite/libgomp.c-c++-common/function-not-offloaded.c
index f01a64e72c07..9e59ef8864e7 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/function-not-offloaded.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/function-not-offloaded.c
@@ -1,6 +1,5 @@
 /* { dg-do link } */
 /* { dg-excess-errors "unresolved symbol foo, lto1, mkoffload and lto-wrapper fatal errors" { target offload_device_nonshared_as } } */
-/* { dg-allow-blank-lines-in-output 1 } */
 /* { dg-additional-sources "function-not-offloaded-aux.c" } */
 
 #pragma omp declare target
-- 
2.25.1


^ permalink raw reply	[flat|nested] 187+ messages in thread

end of thread, other threads:[~2020-03-24 21:06 UTC | newest]

Thread overview: 187+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-05 11:49 [PATCH 00/25] AMD GCN Port ams
2018-09-05 11:49 ` [PATCH 02/25] Propagate address spaces to builtins ams
2018-09-20 13:09   ` Richard Biener
2018-09-22 19:22   ` Andreas Schwab
2018-09-24 16:53     ` Andrew Stubbs
2018-09-24 17:40       ` Andreas Schwab
2018-09-25 14:27     ` [patch] Fix AArch64 ILP ICE Andrew Stubbs
2018-09-26  8:55       ` Andreas Schwab
2018-09-26 13:39       ` Richard Biener
2018-09-26 16:17         ` Andrew Stubbs
2019-09-03 14:01   ` [PATCH 02/25] Propagate address spaces to builtins Kyrill Tkachov
2019-09-03 15:00     ` Jeff Law
2019-09-04 14:21       ` Kyrill Tkachov
2019-09-04 15:29         ` Kyrill Tkachov
2019-09-03 15:43     ` Andrew Stubbs
2018-09-05 11:49 ` [PATCH 04/25] SPECIAL_REGNO_P ams
2018-09-05 12:21   ` Joseph Myers
2018-09-11 22:42   ` Jeff Law
2018-09-12 11:30     ` Andrew Stubbs
2018-09-13 10:03       ` Andrew Stubbs
2018-09-13 14:14         ` Andrew Stubbs
2018-09-13 14:39           ` Paul Koning
2018-09-13 14:49             ` Andrew Stubbs
2018-09-13 14:58               ` Paul Koning
2018-09-13 15:22                 ` Andrew Stubbs
2018-09-13 17:13                   ` Paul Koning
2018-09-17 22:59           ` Jeff Law
2018-10-04 19:13         ` Jeff Law
2018-09-12 15:31   ` Richard Henderson
2018-09-12 16:14     ` Andrew Stubbs
2018-09-05 11:49 ` [PATCH 01/25] Handle vectors that don't fit in an integer ams
2018-09-05 11:54   ` Jakub Jelinek
2018-09-14 16:03   ` Richard Sandiford
2018-11-15 17:20     ` Andrew Stubbs
2018-09-05 11:49 ` [PATCH 05/25] Add sorry_at diagnostic function ams
2018-09-05 13:39   ` David Malcolm
2018-09-05 13:41     ` David Malcolm
2018-09-11 10:30       ` Andrew Stubbs
2018-09-05 11:50 ` [PATCH 07/25] [pr82089] Don't sign-extend SFV 1 in BImode ams
2018-09-17  8:46   ` Richard Sandiford
2018-09-26 15:52     ` Andrew Stubbs
2018-09-26 16:49       ` Richard Sandiford
2018-09-27 12:20         ` Andrew Stubbs
2018-09-05 11:50 ` [PATCH 08/25] Fix co-array allocation ams
     [not found]   ` <7f5064c3-afc6-b7b5-cade-f03af5b86331@moene.org>
2018-09-05 18:07     ` Janne Blomqvist
2018-09-19 16:38       ` Andrew Stubbs
2018-09-19 22:27         ` Damian Rouson
2018-09-19 22:55           ` Andrew Stubbs
2018-09-20  1:21             ` Damian Rouson
2018-09-20 20:49           ` Thomas Koenig
2018-09-20 20:59             ` Damian Rouson
2018-09-21  7:38             ` Toon Moene
2018-09-23 11:57               ` Janne Blomqvist
2018-09-21 16:37             ` OpenCoarrays integration with gfortran Jerry DeLisle
2018-09-21 19:37               ` Janne Blomqvist
2018-09-21 19:44               ` Richard Biener
2018-09-21 20:25               ` Damian Rouson
2018-09-22  3:47                 ` Jerry DeLisle
2018-09-23 10:41                   ` Toon Moene
2018-09-23 18:03                     ` Bernhard Reutner-Fischer
2018-09-24 11:14                     ` Alastair McKinstry
2018-09-27 12:51                       ` Richard Biener
2018-09-20 15:59         ` [PATCH 08/25] Fix co-array allocation Janne Blomqvist
2018-09-20 16:37           ` Andrew Stubbs
2018-09-05 11:50 ` [PATCH 06/25] Remove constant vec_select restriction ams
2018-09-11 22:44   ` Jeff Law
2018-09-05 11:50 ` [PATCH 09/25] Elide repeated RTL elements ams
2018-09-11 22:46   ` Jeff Law
2018-09-12  8:47     ` Andrew Stubbs
2018-09-12 15:14       ` Jeff Law
2018-09-19 17:25     ` Andrew Stubbs
2018-09-20 11:42       ` Andrew Stubbs
2018-09-26 16:23         ` Andrew Stubbs
2018-10-04 18:24         ` Jeff Law
2018-10-11 14:28           ` Andrew Stubbs
2018-09-05 11:50 ` [PATCH 03/25] Improve TARGET_MANGLE_DECL_ASSEMBLER_NAME ams
2018-09-11 22:56   ` Jeff Law
2018-09-12 14:43     ` Richard Biener
2018-09-12 15:07       ` Jeff Law
2018-09-12 15:16         ` Richard Biener
2018-09-12 16:32           ` Andrew Stubbs
2018-09-12 17:39             ` Julian Brown
2018-09-15  6:01               ` Julian Brown
2018-09-19 15:23                 ` Julian Brown
2018-09-20 12:36                   ` Richard Biener
2018-09-05 11:50 ` [PATCH 10/25] Convert BImode vectors ams
2018-09-05 11:56   ` Jakub Jelinek
2018-09-05 12:05   ` Richard Biener
2018-09-05 12:40     ` Andrew Stubbs
2018-09-05 12:44       ` Richard Biener
2018-09-11 14:36         ` Andrew Stubbs
2018-09-12 14:37           ` Richard Biener
2018-09-17  8:51   ` Richard Sandiford
2018-09-05 11:50 ` [PATCH 12/25] Make default_static_chain return NULL in non-static functions ams
2018-09-17 18:55   ` Richard Sandiford
2018-09-28 14:23     ` Andrew Stubbs
2018-09-05 11:51 ` [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE ams
2018-09-17 19:31   ` Richard Sandiford
2018-09-18  9:02     ` Andrew Stubbs
2018-09-18 11:30       ` Richard Sandiford
2018-09-18 20:27         ` Andrew Stubbs
2018-09-19 13:46           ` Richard Biener
2018-09-28 12:48             ` Andrew Stubbs
2018-10-01  8:05               ` Richard Biener
2018-09-05 11:51 ` [PATCH 14/25] Disable inefficient vectorization of elementwise loads/stores ams
2018-09-17  9:16   ` Richard Sandiford
2018-09-17  9:54     ` Andrew Stubbs
2018-09-17 12:40       ` Richard Sandiford
2018-09-17 12:46         ` Andrew Stubbs
2018-09-20 13:01           ` Richard Biener
2018-09-20 13:51             ` Richard Sandiford
2018-09-20 14:14               ` Richard Biener
2018-09-20 14:22                 ` Richard Sandiford
2018-09-05 11:51 ` [PATCH 17/25] Fix Fortran STOP ams
     [not found]   ` <c0630914-1252-1391-9bf9-f03434d46f5a@moene.org>
2018-09-05 18:09     ` Janne Blomqvist
2018-09-12 13:56       ` Andrew Stubbs
2018-09-05 11:51 ` [PATCH 18/25] Fix interleaving of Fortran stop messages ams
     [not found]   ` <994a9ec6-2494-9a83-cc84-bd8a551142c5@moene.org>
2018-09-05 18:11     ` Janne Blomqvist
2018-09-12 13:55       ` Andrew Stubbs
2018-09-05 11:51 ` [PATCH 11/25] Simplify vec_merge according to the mask ams
2018-09-17  9:08   ` Richard Sandiford
2018-09-20 15:44     ` Andrew Stubbs
2018-09-26 16:26       ` Andrew Stubbs
2018-09-26 16:50       ` Richard Sandiford
2018-09-26 17:06         ` Andrew Stubbs
2018-09-27  7:28           ` Richard Sandiford
2018-09-27 14:13             ` Andrew Stubbs
2018-09-27 16:28               ` Richard Sandiford
2018-09-27 21:14                 ` Andrew Stubbs
2018-09-28  8:42                   ` Richard Sandiford
2018-09-28 13:50                     ` Andrew Stubbs
2019-02-22  3:40                       ` H.J. Lu
2018-09-05 11:51 ` [PATCH 15/25] Don't double-count early-clobber matches ams
2018-09-17  9:22   ` Richard Sandiford
2018-09-27 22:54     ` Andrew Stubbs
2018-10-04 22:43       ` Richard Sandiford
2018-10-22 15:36         ` Andrew Stubbs
2018-09-05 11:51 ` [PATCH 16/25] Fix IRA ICE ams
2018-09-17  9:36   ` Richard Sandiford
2018-09-18 22:00     ` Andrew Stubbs
2018-09-20 12:47       ` Richard Sandiford
2018-09-20 13:36         ` Andrew Stubbs
2018-09-05 11:52 ` [PATCH 22/25] Add dg-require-effective-target exceptions ams
2018-09-17  9:40   ` Richard Sandiford
2018-09-17 17:53   ` Mike Stump
2018-09-20 16:10     ` Andrew Stubbs
2018-09-05 11:52 ` [PATCH 24/25] Ignore LLVM's blank lines ams
2018-09-14 16:19   ` Jeff Law
2020-03-23 15:29     ` Thomas Schwinge
2020-03-24 21:05       ` Thomas Schwinge
2018-09-05 11:52 ` [PATCH 20/25] GCN libgcc ams
2018-09-05 12:32   ` Joseph Myers
2018-11-09 18:49   ` Jeff Law
2018-11-12 12:01     ` Andrew Stubbs
2018-09-05 11:52 ` [PATCH 19/25] GCN libgfortran ams
     [not found]   ` <41281e27-ad85-e50c-8fed-6f4f6f18289c@moene.org>
2018-09-05 18:14     ` Janne Blomqvist
2018-09-06 12:37       ` Andrew Stubbs
2018-09-11 22:47   ` Jeff Law
2018-09-05 11:52 ` [PATCH 23/25] Testsuite: GCN is always PIE ams
2018-09-14 16:39   ` Jeff Law
2018-09-05 11:53 ` [PATCH 25/25] Port testsuite to GCN ams
2018-09-05 13:40 ` [PATCH 21/25] GCN Back-end (part 1/2) Andrew Stubbs
2018-11-09 19:11   ` Jeff Law
2018-11-12 12:13     ` Andrew Stubbs
2018-09-05 13:43 ` [PATCH 21/25] GCN Back-end (part 2/2) Andrew Stubbs
2018-09-05 14:22   ` Joseph Myers
2018-09-05 14:35     ` Andrew Stubbs
2018-09-05 14:44       ` Joseph Myers
2018-09-11 16:25         ` Andrew Stubbs
2018-09-11 16:41           ` Joseph Myers
2018-09-12 13:42     ` Andrew Stubbs
2018-09-12 15:32       ` Joseph Myers
2018-09-12 16:46         ` Andrew Stubbs
2018-09-12 16:50           ` Joseph Myers
2018-11-09 19:40   ` Jeff Law
2018-11-12 12:53     ` Andrew Stubbs
2018-11-12 17:20       ` Segher Boessenkool
2018-11-12 17:52         ` Andrew Stubbs
2018-11-12 18:33           ` Segher Boessenkool
2018-11-12 18:55           ` Jeff Law
2018-11-13 10:23             ` Andrew Stubbs
2018-11-13 10:33               ` Segher Boessenkool
2018-11-16 16:10             ` Segher Boessenkool
2018-11-17 14:07               ` Segher Boessenkool
2018-11-14 22:31       ` Jeff Law
2018-11-15  9:55         ` Andrew Stubbs
2018-11-16 13:33           ` Andrew Stubbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).