public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] i386: Separate costs of RTL expressions from costs of moves
@ 2019-06-17 16:27 H.J. Lu
  2019-06-20  7:40 ` Uros Bizjak
  0 siblings, 1 reply; 14+ messages in thread
From: H.J. Lu @ 2019-06-17 16:27 UTC (permalink / raw)
  To: GCC Patches, Uros Bizjak, skpgkp1, Jan Hubicka, Jeffrey Law,
	Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2542 bytes --]

processor_costs has costs of RTL expressions and costs of moves:

1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
to generate RTL expressions with the lowest costs.  Costs of RTL memory
operation can be very close to costs of fast instructions to indicate
fast memory operations.

2. After RTL expressions have been generated, costs of moves are used by
TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
costs for register allocator.  Costs of load and store are higher than
costs of register moves to reduce stack usages by register allocator.

We should separate costs of RTL expressions from costs of moves so that
they can be adjusted independently.  This patch moves costs of moves to
the new used_by_ra field and duplicates costs of moves which are also
used for costs of RTL expressions.

All cost models have been checked with

static void
check_one (const struct processor_costs *p)
{
  if (p->used_by_ra.int_load[2] != p->int_load)
    abort ();
  if (p->used_by_ra.int_store[2] != p->int_store)
    abort ();
  if (p->used_by_ra.xmm_move != p->xmm_move)
    abort ();
  if (p->used_by_ra.sse_to_integer != p->sse_to_integer)
    abort ();
  if (p->used_by_ra.integer_to_sse != p->integer_to_sse)
    abort ();
  if (memcmp (p->used_by_ra.sse_load, p->sse_load, sizeof (p->sse_load)))
    abort ();
  if (memcmp (p->used_by_ra.sse_store, p->sse_store, sizeof (p->sse_store)))
    abort ();
}

static void
check_cost ()
{
 check_one (&ix86_size_cost);
  for (unsigned int i = 0; i < ARRAY_SIZE (processor_cost_table); i++)
    check_one (processor_cost_table[i]);
}

by calling check_cost from ix86_option_override_internal.

PR target/90878
* config/i386/i386-features.c
(dimode_scalar_chain::compute_convert_gain): Replace int_store[2]
and int_load[2] with int_store and int_load.
* config/i386/i386.c (inline_memory_move_cost): Use used_by_ra
for costs of moves.
(ix86_register_move_cost): Likewise.
(ix86_builtin_vectorization_cost): Replace int_store[2] and
int_load[2] with int_store and int_load.
* config/i386/i386.h (processor_costs): Move costs of moves to
used_by_ra.  Add int_load, int_store, xmm_move, sse_to_integer,
integer_to_sse, sse_load, sse_store, sse_unaligned_load and
sse_unaligned_store for costs of RTL expressions.
* config/i386/x86-tune-costs.h: Duplicate int_load, int_store,
xmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store
for costs of RTL expressions.  Use sse_unaligned_load and
sse_unaligned_store only for costs of RTL expressions.

-- 
H.J.

[-- Attachment #2: 0001-i386-Separate-costs-of-RTL-expressions-from-costs-of.patch --]
[-- Type: text/x-patch, Size: 54297 bytes --]

From 1c04d184860d613ba0d789b9bc4e8754ca283e1e Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 14 Jun 2019 13:30:16 -0700
Subject: [PATCH] i386: Separate costs of RTL expressions from costs of moves

processor_costs has costs of RTL expressions and costs of moves:

1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
to generate RTL expressions with the lowest costs.  Costs of RTL memory
operation can be very close to costs of fast instructions to indicate
fast memory operations.

2. After RTL expressions have been generated, costs of moves are used by
TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
costs for register allocator.  Costs of load and store are higher than
costs of register moves to reduce stack usages by register allocator.

We should separate costs of RTL expressions from costs of moves so that
they can be adjusted independently.  This patch moves costs of moves to
the new used_by_ra field and duplicates costs of moves which are also
used for costs of RTL expressions.

All cost models have been checked with

static void
check_one (const struct processor_costs *p)
{
  if (p->used_by_ra.int_load[2] != p->int_load)
    abort ();
  if (p->used_by_ra.int_store[2] != p->int_store)
    abort ();
  if (p->used_by_ra.xmm_move != p->xmm_move)
    abort ();
  if (p->used_by_ra.sse_to_integer != p->sse_to_integer)
    abort ();
  if (p->used_by_ra.integer_to_sse != p->integer_to_sse)
    abort ();
  if (memcmp (p->used_by_ra.sse_load, p->sse_load, sizeof (p->sse_load)))
    abort ();
  if (memcmp (p->used_by_ra.sse_store, p->sse_store, sizeof (p->sse_store)))
    abort ();
}

static void
check_cost ()
{
 check_one (&ix86_size_cost);
  for (unsigned int i = 0; i < ARRAY_SIZE (processor_cost_table); i++)
    check_one (processor_cost_table[i]);
}

by calling check_cost from ix86_option_override_internal.

	PR target/90878
	* config/i386/i386-features.c
	(dimode_scalar_chain::compute_convert_gain): Replace int_store[2]
	and int_load[2] with int_store and int_load.
	* config/i386/i386.c (inline_memory_move_cost): Use used_by_ra
	for costs of moves.
	(ix86_register_move_cost): Likewise.
	(ix86_builtin_vectorization_cost): Replace int_store[2] and
	int_load[2] with int_store and int_load.
	* config/i386/i386.h (processor_costs): Move costs of moves to
	used_by_ra.  Add int_load, int_store, xmm_move, sse_to_integer,
	integer_to_sse, sse_load, sse_store, sse_unaligned_load and
	sse_unaligned_store for costs of RTL expressions.
	* config/i386/x86-tune-costs.h: Duplicate int_load, int_store,
	xmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store
	for costs of RTL expressions.  Use sse_unaligned_load and
	sse_unaligned_store only for costs of RTL expressions.
---
 gcc/config/i386/i386-features.c  |   6 +-
 gcc/config/i386/i386.c           |  63 +++--
 gcc/config/i386/i386.h           |  49 ++--
 gcc/config/i386/x86-tune-costs.h | 409 ++++++++++++++++++++++++-------
 4 files changed, 388 insertions(+), 139 deletions(-)

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 2eac8f715bb..34eb70c874f 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -501,9 +501,9 @@ dimode_scalar_chain::compute_convert_gain ()
       if (REG_P (src) && REG_P (dst))
 	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	gain += 2 * ix86_cost->int_store - ix86_cost->sse_store[1];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	gain += 2 * ix86_cost->int_load - ix86_cost->sse_load[1];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
@@ -543,7 +543,7 @@ dimode_scalar_chain::compute_convert_gain ()
 	  if (REG_P (dst))
 	    gain += COSTS_N_INSNS (2);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	    gain += 2 * ix86_cost->int_store - ix86_cost->sse_store[1];
 	  gain -= vector_const_cost (src);
 	}
       else
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 941e208bcf0..bf3184f4a8b 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -18511,8 +18511,10 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
 	    return 100;
 	}
       if (in == 2)
-        return MAX (ix86_cost->fp_load [index], ix86_cost->fp_store [index]);
-      return in ? ix86_cost->fp_load [index] : ix86_cost->fp_store [index];
+        return MAX (ix86_cost->used_by_ra.fp_load [index],
+		    ix86_cost->used_by_ra.fp_store [index]);
+      return in ? ix86_cost->used_by_ra.fp_load [index]
+		: ix86_cost->used_by_ra.fp_store [index];
     }
   if (SSE_CLASS_P (regclass))
     {
@@ -18520,8 +18522,10 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
       if (index == -1)
 	return 100;
       if (in == 2)
-        return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
-      return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
+        return MAX (ix86_cost->used_by_ra.sse_load [index],
+		    ix86_cost->used_by_ra.sse_store [index]);
+      return in ? ix86_cost->used_by_ra.sse_load [index]
+		: ix86_cost->used_by_ra.sse_store [index];
     }
   if (MMX_CLASS_P (regclass))
     {
@@ -18538,8 +18542,10 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
 	    return 100;
 	}
       if (in == 2)
-        return MAX (ix86_cost->mmx_load [index], ix86_cost->mmx_store [index]);
-      return in ? ix86_cost->mmx_load [index] : ix86_cost->mmx_store [index];
+        return MAX (ix86_cost->used_by_ra.mmx_load [index],
+		    ix86_cost->used_by_ra.mmx_store [index]);
+      return in ? ix86_cost->used_by_ra.mmx_load [index]
+		: ix86_cost->used_by_ra.mmx_store [index];
     }
   switch (GET_MODE_SIZE (mode))
     {
@@ -18547,37 +18553,41 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
 	if (Q_CLASS_P (regclass) || TARGET_64BIT)
 	  {
 	    if (!in)
-	      return ix86_cost->int_store[0];
+	      return ix86_cost->used_by_ra.int_store[0];
 	    if (TARGET_PARTIAL_REG_DEPENDENCY
 	        && optimize_function_for_speed_p (cfun))
-	      cost = ix86_cost->movzbl_load;
+	      cost = ix86_cost->used_by_ra.movzbl_load;
 	    else
-	      cost = ix86_cost->int_load[0];
+	      cost = ix86_cost->used_by_ra.int_load[0];
 	    if (in == 2)
-	      return MAX (cost, ix86_cost->int_store[0]);
+	      return MAX (cost, ix86_cost->used_by_ra.int_store[0]);
 	    return cost;
 	  }
 	else
 	  {
 	   if (in == 2)
-	     return MAX (ix86_cost->movzbl_load, ix86_cost->int_store[0] + 4);
+	     return MAX (ix86_cost->used_by_ra.movzbl_load,
+			 ix86_cost->used_by_ra.int_store[0] + 4);
 	   if (in)
-	     return ix86_cost->movzbl_load;
+	     return ix86_cost->used_by_ra.movzbl_load;
 	   else
-	     return ix86_cost->int_store[0] + 4;
+	     return ix86_cost->used_by_ra.int_store[0] + 4;
 	  }
 	break;
       case 2:
 	if (in == 2)
-	  return MAX (ix86_cost->int_load[1], ix86_cost->int_store[1]);
-	return in ? ix86_cost->int_load[1] : ix86_cost->int_store[1];
+	  return MAX (ix86_cost->used_by_ra.int_load[1],
+		      ix86_cost->used_by_ra.int_store[1]);
+	return in ? ix86_cost->used_by_ra.int_load[1]
+		  : ix86_cost->used_by_ra.int_store[1];
       default:
 	if (in == 2)
-	  cost = MAX (ix86_cost->int_load[2], ix86_cost->int_store[2]);
+	  cost = MAX (ix86_cost->used_by_ra.int_load[2],
+		      ix86_cost->used_by_ra.int_store[2]);
 	else if (in)
-	  cost = ix86_cost->int_load[2];
+	  cost = ix86_cost->used_by_ra.int_load[2];
 	else
-	  cost = ix86_cost->int_store[2];
+	  cost = ix86_cost->used_by_ra.int_store[2];
 	/* Multiply with the number of GPR moves needed.  */
 	return cost * CEIL ((int) GET_MODE_SIZE (mode), UNITS_PER_WORD);
     }
@@ -18647,20 +18657,21 @@ ix86_register_move_cost (machine_mode mode, reg_class_t class1_i,
        because of missing QImode and HImode moves to, from or between
        MMX/SSE registers.  */
     return MAX (8, SSE_CLASS_P (class1)
-		? ix86_cost->sse_to_integer : ix86_cost->integer_to_sse);
+		? ix86_cost->used_by_ra.sse_to_integer
+		: ix86_cost->used_by_ra.integer_to_sse);
 
   if (MAYBE_FLOAT_CLASS_P (class1))
-    return ix86_cost->fp_move;
+    return ix86_cost->used_by_ra.fp_move;
   if (MAYBE_SSE_CLASS_P (class1))
     {
       if (GET_MODE_BITSIZE (mode) <= 128)
-	return ix86_cost->xmm_move;
+	return ix86_cost->used_by_ra.xmm_move;
       if (GET_MODE_BITSIZE (mode) <= 256)
-	return ix86_cost->ymm_move;
-      return ix86_cost->zmm_move;
+	return ix86_cost->used_by_ra.ymm_move;
+      return ix86_cost->used_by_ra.zmm_move;
     }
   if (MAYBE_MMX_CLASS_P (class1))
-    return ix86_cost->mmx_move;
+    return ix86_cost->used_by_ra.mmx_move;
   return 2;
 }
 
@@ -21071,11 +21082,11 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 	/* load/store costs are relative to register move which is 2. Recompute
  	   it to COSTS_N_INSNS so everything have same base.  */
         return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0]
-			      : ix86_cost->int_load [2]) / 2;
+			      : ix86_cost->int_load) / 2;
 
       case scalar_store:
         return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0]
-			      : ix86_cost->int_store [2]) / 2;
+			      : ix86_cost->int_store) / 2;
 
       case vector_stmt:
         return ix86_vec_cost (mode,
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 0ac5d651823..1c7ef500d37 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -235,7 +235,11 @@ struct stringop_algs
   } size [MAX_STRINGOP_ALGS];
 };
 
-/* Define the specific costs for a given cpu */
+/* Define the specific costs for a given cpu.  NB: used_by_ra is used
+   by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute
+   move costs for register allocator.  Don't use it to describe the
+   relative costs of RTL expressions in TARGET_RTX_COSTS.
+ */
 
 struct processor_costs {
   const int add;		/* cost of an add instruction */
@@ -252,32 +256,47 @@ struct processor_costs {
   const int large_insn;		/* insns larger than this cost more */
   const int move_ratio;		/* The threshold of number of scalar
 				   memory-to-memory move insns.  */
-  const int movzbl_load;	/* cost of loading using movzbl */
-  const int int_load[3];	/* cost of loading integer registers
+
+  /* Costs used by register allocator.  integer->integer register move
+     cost is 2.  */
+  struct
+    {
+      const int movzbl_load;	/* cost of loading using movzbl */
+      const int int_load[3];	/* cost of loading integer registers
 				   in QImode, HImode and SImode relative
 				   to reg-reg move (2).  */
-  const int int_store[3];	/* cost of storing integer register
+      const int int_store[3];	/* cost of storing integer register
 				   in QImode, HImode and SImode */
-  const int fp_move;		/* cost of reg,reg fld/fst */
-  const int fp_load[3];		/* cost of loading FP register
+      const int fp_move;	/* cost of reg,reg fld/fst */
+      const int fp_load[3];	/* cost of loading FP register
 				   in SFmode, DFmode and XFmode */
-  const int fp_store[3];	/* cost of storing FP register
+      const int fp_store[3];	/* cost of storing FP register
 				   in SFmode, DFmode and XFmode */
-  const int mmx_move;		/* cost of moving MMX register.  */
-  const int mmx_load[2];	/* cost of loading MMX register
+      const int mmx_move;	/* cost of moving MMX register.  */
+      const int mmx_load[2];	/* cost of loading MMX register
 				   in SImode and DImode */
-  const int mmx_store[2];	/* cost of storing MMX register
+      const int mmx_store[2];	/* cost of storing MMX register
 				   in SImode and DImode */
-  const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
-	    zmm_move;
+      const int xmm_move;	/* cost of moving XMM register.  */
+      const int ymm_move;	/* cost of moving XMM register.  */
+      const int zmm_move;	/* cost of moving XMM register.  */
+      const int sse_load[5];	/* cost of loading SSE register
+				   in 32bit, 64bit, 128bit, 256bit and 512bit */
+      const int sse_store[5];	/* cost of storing SSE register
+				   in SImode, DImode and TImode.  */
+      const int sse_to_integer;	/* cost of moving SSE register to integer.  */
+      const int integer_to_sse;	/* cost of moving integer register to SSE. */
+    } used_by_ra;
+  const int int_load;		/* cost of loading integer register.  */
+  const int int_store;		/* cost of storing integer register.  */
   const int sse_load[5];	/* cost of loading SSE register
 				   in 32bit, 64bit, 128bit, 256bit and 512bit */
-  const int sse_unaligned_load[5];/* cost of unaligned load.  */
   const int sse_store[5];	/* cost of storing SSE register
-				   in SImode, DImode and TImode.  */
+				   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  const int sse_unaligned_load[5];/* cost of unaligned load.  */
   const int sse_unaligned_store[5];/* cost of unaligned store.  */
+  const int xmm_move;		/* cost of moving XMM register.  */
   const int sse_to_integer;	/* cost of moving SSE register to integer.  */
-  const int integer_to_sse;	/* cost of moving integer register to SSE. */
   const int gather_static, gather_per_elt; /* Cost of gather load is computed
 				   as static + per_item * nelts. */
   const int scatter_static, scatter_per_elt; /* Cost of gather store is
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index ac06e37733a..879f5aeb09f 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -56,7 +56,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   0,					/* "large" insn */
   2,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   2,				     /* cost for loading QImode using movzbl */
   {2, 2, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -75,13 +75,23 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   3, 3, 3,				/* cost of moving XMM,YMM,ZMM register */
   {3, 3, 3, 3, 3},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {3, 3, 3, 3, 3},			/* cost of unaligned SSE load
-					   in 128bit, 256bit and 512bit */
   {3, 3, 3, 3, 3},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {3, 3, 3, 3, 3},				/* cost of unaligned SSE store
-					   in 128bit, 256bit and 512bit */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {3, 3, 3, 3, 3},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {3, 3, 3, 3, 3},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {3, 3, 3, 3, 3},			/* cost of unaligned SSE load
+					   in 128bit, 256bit and 512bit */
+  {3, 3, 3, 3, 3},			/* cost of unaligned SSE store
+					   in 128bit, 256bit and 512bit */
+  3,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   5, 0,					/* Gather load static, per_elt.  */
   5, 0,					/* Gather store static, per_elt.  */
   0,					/* size of l1 cache  */
@@ -147,8 +157,7 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   15,					/* "large" insn */
   3,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -167,11 +176,21 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   0,					/* size of l1 cache  */
@@ -236,8 +255,7 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   15,					/* "large" insn */
   3,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -256,11 +274,21 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   4,					/* size of l1 cache.  486 has 8kB cache
@@ -327,8 +355,7 @@ struct processor_costs pentium_cost = {
   8,					/* "large" insn */
   6,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -347,11 +374,21 @@ struct processor_costs pentium_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -409,8 +446,7 @@ struct processor_costs lakemont_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -429,11 +465,21 @@ struct processor_costs lakemont_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -506,8 +552,7 @@ struct processor_costs pentiumpro_cost = {
   8,					/* "large" insn */
   6,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   2,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -526,11 +571,21 @@ struct processor_costs pentiumpro_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -594,8 +649,7 @@ struct processor_costs geode_cost = {
   8,					/* "large" insn */
   4,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   2,				     /* cost for loading QImode using movzbl */
   {2, 2, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -615,11 +669,21 @@ struct processor_costs geode_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {2, 2, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {2, 2, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
+  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
   2, 2,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -683,8 +747,7 @@ struct processor_costs k6_cost = {
   8,					/* "large" insn */
   4,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   3,				     /* cost for loading QImode using movzbl */
   {4, 5, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -703,11 +766,21 @@ struct processor_costs k6_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {2, 2, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {2, 2, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
+  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
   2, 2,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -777,8 +850,7 @@ struct processor_costs athlon_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {3, 4, 3},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -797,11 +869,21 @@ struct processor_costs athlon_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 4, 12, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
   5, 5,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  3,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {4, 4, 12, 12, 24},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 10, 10, 20},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
+  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -873,8 +955,7 @@ struct processor_costs k8_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {3, 4, 3},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -893,11 +974,21 @@ struct processor_costs k8_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 3, 12, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
   5, 5,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  3,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {4, 3, 12, 12, 24},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 10, 10, 20},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
+  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -973,8 +1064,7 @@ struct processor_costs amdfam10_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {3, 4, 3},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -993,11 +1083,11 @@ struct processor_costs amdfam10_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 4, 3, 6, 12},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 3, 7, 12},			/* cost of unaligned loads.  */
   {4, 4, 5, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
   					/* On K8:
   					    MOVD reg64, xmmreg Double FSTORE 4
 					    MOVD reg32, xmmreg Double FSTORE 4
@@ -1006,6 +1096,16 @@ struct processor_costs amdfam10_cost = {
 							       1/1  1/1
 					    MOVD reg32, xmmreg Double FADD 3
 							       1/1  1/1 */
+  3,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {4, 4, 3, 6, 12},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 5, 10, 20},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 3, 7, 12},			/* cost of unaligned loads.  */
+  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -1082,8 +1182,7 @@ const struct processor_costs bdver_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,				     /* cost for loading QImode using movzbl */
   {8, 8, 8},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1102,11 +1201,21 @@ const struct processor_costs bdver_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {12, 12, 10, 40, 60},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
   {10, 10, 10, 40, 60},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
   16, 20,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  8,					/* cost of loading integer register.  */
+  8,					/* cost of storing integer register.  */
+  {12, 12, 10, 40, 60},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 10, 40, 60},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
+  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  16,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
   16,					/* size of l1 cache.  */
@@ -1187,8 +1296,7 @@ struct processor_costs znver1_cost = {
   8,					/* "large" insn.  */
   9,					/* MOVE_RATIO.  */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
 
   /* reg-reg moves are done by renaming and thus they are even cheaper than
      1 cycle. Becuase reg-reg move cost is 2 and the following tables correspond
@@ -1214,11 +1322,21 @@ struct processor_costs znver1_cost = {
   2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
   {6, 6, 6, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE moves.  */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  8,					/* cost of storing integer register.  */
+  {6, 6, 6, 12, 24},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
+  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
      throughput 12.  Approx 9 uops do not depend on vector size and every load
      is 7 uops.  */
@@ -1311,8 +1429,7 @@ struct processor_costs znver2_cost = {
   8,					/* "large" insn.  */
   9,					/* MOVE_RATIO.  */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2.  */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
 
   /* reg-reg moves are done by renaming and thus they are even cheaper than
      1 cycle.  Because reg-reg move cost is 2 and following tables correspond
@@ -1339,12 +1456,22 @@ struct processor_costs znver2_cost = {
 					   register.  */
   {6, 6, 6, 10, 20},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE
 					   moves.  */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  8,					/* cost of storing integer register.  */
+  {6, 6, 6, 10, 20},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 8, 16},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
      throughput 12.  Approx 9 uops do not depend on vector size and every load
      is 7 uops.  */
@@ -1438,6 +1565,7 @@ struct processor_costs skylake_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1456,11 +1584,21 @@ struct processor_costs skylake_cost = {
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 10, 20},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 12, 24},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
   2, 2,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {6, 6, 6, 10, 20},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 12, 24},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  2,					/* cost of moving SSE register to integer.  */
   20, 8,				/* Gather load static, per_elt.  */
   22, 10,				/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -1529,8 +1667,7 @@ const struct processor_costs btver1_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,				     /* cost for loading QImode using movzbl */
   {6, 8, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1549,11 +1686,21 @@ const struct processor_costs btver1_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {10, 10, 12, 48, 96},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
   14, 14,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {10, 10, 12, 48, 96},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
+  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -1620,8 +1767,7 @@ const struct processor_costs btver2_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,				     /* cost for loading QImode using movzbl */
   {8, 8, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1640,11 +1786,21 @@ const struct processor_costs btver2_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {10, 10, 12, 48, 96},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
   14, 14,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {10, 10, 12, 48, 96},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
+  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -1710,8 +1866,7 @@ struct processor_costs pentium4_cost = {
   16,					/* "large" insn */
   6,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   5,				     /* cost for loading QImode using movzbl */
   {4, 5, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1730,11 +1885,21 @@ struct processor_costs pentium4_cost = {
   12, 24, 48,				/* cost of moving XMM,YMM,ZMM register */
   {16, 16, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {32, 32, 32, 64, 128},		/* cost of unaligned loads.  */
   {16, 16, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {32, 32, 32, 64, 128},		/* cost of unaligned stores.  */
   20, 12,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {16, 16, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {16, 16, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {32, 32, 32, 64, 128},		/* cost of unaligned loads.  */
+  {32, 32, 32, 64, 128},		/* cost of unaligned stores.  */
+  12,					/* cost of moving XMM register.  */
+  20,					/* cost of moving SSE register to integer.  */
   16, 16,				/* Gather load static, per_elt.  */
   16, 16,				/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -1803,8 +1968,7 @@ struct processor_costs nocona_cost = {
   16,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1823,11 +1987,21 @@ struct processor_costs nocona_cost = {
   6, 12, 24,				/* cost of moving XMM,YMM,ZMM register */
   {12, 12, 12, 24, 48},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {24, 24, 24, 48, 96},			/* cost of unaligned loads.  */
   {12, 12, 12, 24, 48},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {24, 24, 24, 48, 96},			/* cost of unaligned stores.  */
   20, 12,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  4,					/* cost of storing integer register.  */
+  {12, 12, 12, 24, 48},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {12, 12, 12, 24, 48},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {24, 24, 24, 48, 96},			/* cost of unaligned loads.  */
+  {24, 24, 24, 48, 96},			/* cost of unaligned stores.  */
+  6,					/* cost of moving XMM register.  */
+  20,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
   12, 12,				/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -1894,8 +2068,7 @@ struct processor_costs atom_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,					/* cost for loading QImode using movzbl */
   {6, 6, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1914,11 +2087,21 @@ struct processor_costs atom_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {8, 8, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
   8, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {8, 8, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
+  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
   8, 8,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -1985,8 +2168,7 @@ struct processor_costs slm_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,					/* cost for loading QImode using movzbl */
   {8, 8, 8},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2005,11 +2187,21 @@ struct processor_costs slm_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {8, 8, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
   8, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  8,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {8, 8, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 16, 32},			/* cost of storing SSE register
+					   in SImode, DImode and TImode.  */
+  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
+  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
   8, 8,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -2076,8 +2268,7 @@ struct processor_costs intel_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2096,11 +2287,21 @@ struct processor_costs intel_cost = {
   2, 2, 2,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 6, 6},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
   {6, 6, 6, 6, 6},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
   4, 4,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {6, 6, 6, 6, 6},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 6, 6},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
+  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
+  2,					/* cost of moving XMM register.  */
+  4,					/* cost of moving SSE register to integer.  */
   6, 6,					/* Gather load static, per_elt.  */
   6, 6,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -2174,8 +2375,7 @@ struct processor_costs generic_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {6, 6, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2194,11 +2394,21 @@ struct processor_costs generic_cost = {
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 10, 15},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
   6, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {6, 6, 6, 10, 15},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
+  {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
   18, 6,				/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -2278,8 +2488,7 @@ struct processor_costs core_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2298,11 +2507,21 @@ struct processor_costs core_cost = {
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 6, 12},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {6, 6, 6, 6, 12},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 6, 12},			/* cost of unaligned stores.  */
   2, 2,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {6, 6, 6, 6, 12},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 6, 12},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
+  {6, 6, 6, 6, 12},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  2,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
      rec. throughput 6.
      So 5 uops statically and one uops per load.  */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-08-09 23:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-17 16:27 [PATCH] i386: Separate costs of RTL expressions from costs of moves H.J. Lu
2019-06-20  7:40 ` Uros Bizjak
2019-06-20  7:43   ` Uros Bizjak
2019-06-20 15:19     ` H.J. Lu
2019-06-20 20:33       ` Uros Bizjak
2019-06-20 21:10         ` Jan Hubicka
2019-06-20 21:43           ` H.J. Lu
2019-06-23 11:18             ` Jan Hubicka
2019-06-24 13:37           ` Richard Biener
2019-06-24 16:16             ` H.J. Lu
2019-07-23 22:11               ` [PATCH] i386: Separate costs of pseudo registers from hard registers H.J. Lu
2019-08-05 21:21                 ` PING^1 " H.J. Lu
2019-08-09 22:14                 ` Jeff Law
2019-08-10  0:47                   ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).