public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [Patch 0/2 ARM/AArch64] Add a new Cortex-A53 scheduling model
@ 2015-09-25  8:20 James Greenhalgh
  2015-09-25  8:30 ` [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute James Greenhalgh
  2015-09-25  8:37 ` [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
  0 siblings, 2 replies; 11+ messages in thread
From: James Greenhalgh @ 2015-09-25  8:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: marcus.shawcroft, richard.earnshaw, kyrtka01, ramana.radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 1473 bytes --]

Hi,

This two patch series updates our scheduling model for the Cortex-A53
processor.

Patch 1/2 splits out the "shift_imm" type attribute used by the AArch64
target to give us two types - rotate_imm and shift_imm.

Patch 2/2 introduces the new scheduling model.

I've bootstrapped and tested the patches in series and individually
on both arm-none-linux-gnueabihf and aarch64-none-linux-gnu targets,
and I've checked the performance against a set of popular benchmark
suites to show a benefit.

OK for trunk?

Thanks,
James

---
[Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute

2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/arm/types.md (type): Add rotate_imm.
	* config/aarch64/aarch64.md (*ror<mode>3_insn): Split out the
	ROR immediate case.
	(*rorsi3_insn_uxtw): Likewise.
	* config/aarch64/thunderx.md (thunderx_shift): Add rotate_imm.
	* config/arm/cortex-a53.md (cortex_a53_alu_shift): Add rotate_imm.
	* config/arm/cortex-a57.md (cortex_a53_alu): Add rotate_imm.

[Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model

2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/arm/aarch-common-protos.h
	(aarch_accumulator_forwarding): New.
	(aarch_forward_to_shift_is_not_shifted_reg): Likewise.
	* config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
	(aarch_forward_to_shift_is_not_shifted_reg): Liekwise.
	* config/arm/cortex-a53.md: Rewrite.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute
  2015-09-25  8:20 [Patch 0/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
@ 2015-09-25  8:30 ` James Greenhalgh
  2015-09-25 13:19   ` Kyrill Tkachov
  2015-09-25  8:37 ` [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
  1 sibling, 1 reply; 11+ messages in thread
From: James Greenhalgh @ 2015-09-25  8:30 UTC (permalink / raw)
  To: gcc-patches
  Cc: marcus.shawcroft, richard.earnshaw, kyrtka01, ramana.radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]


Hi,

This patch splits the "shift_imm" type attribute used by AArch64 in
two - giving rotate_imm and shift_imm.

We then apply this transform across the AArch64 pipeline descriptions
which have modelling for shift_imm (cortex-a53, cortex-a57, thunderx).
This should give no functional change to these models.

Bootstrapped and tested on aarch64-none-linux-gnu, and
arm-none-linux-gnueabihf with no issues.

OK?

Thanks,
James

---
2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/arm/types.md (type): Add rotate_imm.
	* config/aarch64/aarch64.md (*ror<mode>3_insn): Split out the
	ROR immediate case.
	(*rorsi3_insn_uxtw): Likewise.
	* config/aarch64/thunderx.md (thunderx_shift): Add rotate_imm.
	* config/arm/cortex-a53.md (cortex_a53_alu_shift): Add rotate_imm.
	* config/arm/cortex-a57.md (cortex_a53_alu): Add rotate_imm.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Patch-1-2-AArch64-ARM-Give-AArch64-ROR-Immediate-a-n.patch --]
[-- Type: text/x-patch;  name=0001-Patch-1-2-AArch64-ARM-Give-AArch64-ROR-Immediate-a-n.patch, Size: 3652 bytes --]

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 78b9ae2..4f7323c 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3807,13 +3807,15 @@
 
 ;; Rotate right
 (define_insn "*ror<mode>3_insn"
-  [(set (match_operand:GPI 0 "register_operand" "=r")
-        (rotatert:GPI
-          (match_operand:GPI 1 "register_operand" "r")
-          (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "rUs<cmode>")))]
+  [(set (match_operand:GPI 0 "register_operand" "=r,r")
+     (rotatert:GPI
+       (match_operand:GPI 1 "register_operand" "r,r")
+       (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "r,Us<cmode>")))]
   ""
-  "ror\\t%<w>0, %<w>1, %<w>2"
-  [(set_attr "type" "shift_reg")]
+  "@
+   ror\\t%<w>0, %<w>1, %<w>2
+   ror\\t%<w>0, %<w>1, %<w>2"
+  [(set_attr "type" "shift_reg, rotate_imm")]
 )
 
 ;; zero_extend version of above
@@ -3902,7 +3904,7 @@
   operands[3] = GEN_INT (<sizen> - UINTVAL (operands[2]));
   return "ror\\t%<w>0, %<w>1, %3";
 }
-  [(set_attr "type" "shift_imm")]
+  [(set_attr "type" "rotate_imm")]
 )
 
 ;; zero_extend version of the above
@@ -3916,7 +3918,7 @@
   operands[3] = GEN_INT (32 - UINTVAL (operands[2]));
   return "ror\\t%w0, %w1, %3";
 }
-  [(set_attr "type" "shift_imm")]
+  [(set_attr "type" "rotate_imm")]
 )
 
 (define_insn "*<ANY_EXTEND:optab><GPI:mode>_ashl<SHORT:mode>"
diff --git a/gcc/config/aarch64/thunderx.md b/gcc/config/aarch64/thunderx.md
index cf96368..3dae963 100644
--- a/gcc/config/aarch64/thunderx.md
+++ b/gcc/config/aarch64/thunderx.md
@@ -39,7 +39,7 @@
 
 (define_insn_reservation "thunderx_shift" 1
   (and (eq_attr "tune" "thunderx")
-       (eq_attr "type" "bfm,extend,shift_imm,shift_reg,rbit,rev"))
+       (eq_attr "type" "bfm,extend,rotate_imm,shift_imm,shift_reg,rbit,rev"))
   "thunderx_pipe0 | thunderx_pipe1")
 
 
diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md
index db572f6..3fa0625 100644
--- a/gcc/config/arm/cortex-a53.md
+++ b/gcc/config/arm/cortex-a53.md
@@ -76,7 +76,7 @@
                         alu_sreg,alus_sreg,logic_reg,logics_reg,\
                         adc_imm,adcs_imm,adc_reg,adcs_reg,\
                         adr,bfm,csel,clz,rbit,rev,alu_dsp_reg,\
-                        shift_imm,shift_reg,\
+                        rotate_imm,shift_imm,shift_reg,\
                         mov_imm,mov_reg,mvn_imm,mvn_reg,\
                         mrs,multiple,no_insn"))
   "cortex_a53_slot_any")
diff --git a/gcc/config/arm/cortex-a57.md b/gcc/config/arm/cortex-a57.md
index a32c848..d6ce440 100644
--- a/gcc/config/arm/cortex-a57.md
+++ b/gcc/config/arm/cortex-a57.md
@@ -296,7 +296,7 @@
 			alu_sreg,alus_sreg,logic_reg,logics_reg,\
 			adc_imm,adcs_imm,adc_reg,adcs_reg,\
 			adr,bfm,clz,rbit,rev,alu_dsp_reg,\
-			shift_imm,shift_reg,\
+			rotate_imm,shift_imm,shift_reg,\
 			mov_imm,mov_reg,\
 			mvn_imm,mvn_reg,\
 			mrs,multiple,no_insn"))
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index ec609ae..534be74 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -120,6 +120,7 @@
 ;                    final output, thus having no impact on scheduling.
 ; rbit               reverse bits.
 ; rev                reverse bytes.
+; rotate_imm         rotate by immediate.
 ; sdiv               signed division.
 ; shift_imm          simple shift operation (LSL, LSR, ASR, ROR) with an
 ;                    immediate.
@@ -627,6 +628,7 @@
   nop,\
   rbit,\
   rev,\
+  rotate_imm,\
   sdiv,\
   shift_imm,\
   shift_reg,\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model
  2015-09-25  8:20 [Patch 0/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
  2015-09-25  8:30 ` [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute James Greenhalgh
@ 2015-09-25  8:37 ` James Greenhalgh
  2015-10-01  8:31   ` Kyrill Tkachov
  2015-10-01  8:32   ` Marcus Shawcroft
  1 sibling, 2 replies; 11+ messages in thread
From: James Greenhalgh @ 2015-09-25  8:37 UTC (permalink / raw)
  To: gcc-patches
  Cc: marcus.shawcroft, richard.earnshaw, kyrtka01, ramana.radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 642 bytes --]


Hi,

This patch introduces a new scheduling model for Cortex-A53.

Bootstrapped and tested on arm-none-linux-gnueabi and aarch64-none-linux-gnu
and checked with a variety of popular benchmarking and microbenchmarking
suites to show a benefit.

OK?

Thanks,
James

---
2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/arm/aarch-common-protos.h
	(aarch_accumulator_forwarding): New.
	(aarch_forward_to_shift_is_not_shifted_reg): Likewise.
	* config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
	(aarch_forward_to_shift_is_not_shifted_reg): Liekwise.
	* config/arm/cortex-a53.md: Rewrite.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0002-Patch-2-2-ARM-AArch64-Add-a-new-Cortex-A53-schedulin.patch --]
[-- Type: text/x-patch;  name=0002-Patch-2-2-ARM-AArch64-Add-a-new-Cortex-A53-schedulin.patch, Size: 40418 bytes --]

diff --git a/gcc/config/arm/aarch-common-protos.h b/gcc/config/arm/aarch-common-protos.h
index 29f7c99..348ae74 100644
--- a/gcc/config/arm/aarch-common-protos.h
+++ b/gcc/config/arm/aarch-common-protos.h
@@ -23,7 +23,9 @@
 #ifndef GCC_AARCH_COMMON_PROTOS_H
 #define GCC_AARCH_COMMON_PROTOS_H
 
+extern int aarch_accumulator_forwarding (rtx_insn *, rtx_insn *);
 extern int aarch_crypto_can_dual_issue (rtx_insn *, rtx_insn *);
+extern int aarch_forward_to_shift_is_not_shifted_reg (rtx_insn *, rtx_insn *);
 extern bool aarch_rev16_p (rtx);
 extern bool aarch_rev16_shleft_mask_imm_p (rtx, machine_mode);
 extern bool aarch_rev16_shright_mask_imm_p (rtx, machine_mode);
diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c
index 5dd8222..43579d8 100644
--- a/gcc/config/arm/aarch-common.c
+++ b/gcc/config/arm/aarch-common.c
@@ -394,6 +394,112 @@ arm_mac_accumulator_is_result (rtx producer, rtx consumer)
           && !reg_overlap_mentioned_p (result, op1));
 }
 
+/* Return non-zero if the destination of PRODUCER feeds the accumulator
+   operand of an MLA-like operation.  */
+
+int
+aarch_accumulator_forwarding (rtx_insn *producer, rtx_insn *consumer)
+{
+  rtx producer_set = single_set (producer);
+  rtx consumer_set = single_set (consumer);
+
+  /* We are looking for a SET feeding a SET.  */
+  if (!producer_set || !consumer_set)
+    return 0;
+
+  rtx dest = SET_DEST (producer_set);
+  rtx mla = SET_SRC (consumer_set);
+
+  /* We're looking for a register SET.  */
+  if (!REG_P (dest))
+    return 0;
+
+  rtx accumulator;
+
+  /* Strip a zero_extend.  */
+  if (GET_CODE (mla) == ZERO_EXTEND)
+    mla = XEXP (mla, 0);
+
+  switch (GET_CODE (mla))
+    {
+    case PLUS:
+      /* Possibly an MADD.  */
+      if (GET_CODE (XEXP (mla, 0)) == MULT)
+	accumulator = XEXP (mla, 1);
+      else
+	return 0;
+      break;
+    case MINUS:
+      /* Possibly an MSUB.  */
+      if (GET_CODE (XEXP (mla, 1)) == MULT)
+	accumulator = XEXP (mla, 0);
+      else
+	return 0;
+      break;
+    case FMA:
+	{
+	  /* Possibly an FMADD/FMSUB/FNMADD/FNMSUB.  */
+	  if (REG_P (XEXP (mla, 1))
+	      && REG_P (XEXP (mla, 2))
+	      && (REG_P (XEXP (mla, 0))
+		  || GET_CODE (XEXP (mla, 0)) == NEG))
+
+	    {
+	      /* FMADD/FMSUB.  */
+	      accumulator = XEXP (mla, 2);
+	    }
+	  else if (REG_P (XEXP (mla, 1))
+		   && GET_CODE (XEXP (mla, 2)) == NEG
+		   && (REG_P (XEXP (mla, 0))
+		       || GET_CODE (XEXP (mla, 0)) == NEG))
+	    {
+	      /* FNMADD/FNMSUB.  */
+	      accumulator = XEXP (XEXP (mla, 2), 0);
+	    }
+	  else
+	    return 0;
+	  break;
+	}
+      default:
+	/* Not an MLA-like operation.  */
+	return 0;
+    }
+
+  return (REGNO (dest) == REGNO (accumulator));
+}
+
+/* Return nonzero if the CONSUMER instruction is some sort of
+   arithmetic or logic + shift operation, and the register we are
+   writing in PRODUCER is not used in a register shift by register
+   operation.  */
+
+int
+aarch_forward_to_shift_is_not_shifted_reg (rtx_insn *producer,
+					   rtx_insn *consumer)
+{
+  rtx value, op;
+  rtx early_op;
+
+  if (!arm_get_set_operands (producer, consumer, &value, &op))
+    return 0;
+
+  if ((early_op = arm_find_shift_sub_rtx (op)))
+    {
+      if (REG_P (early_op))
+	early_op = op;
+
+      /* Any other canonicalisation of a shift is a shift-by-constant
+	 so we don't care.  */
+      if (GET_CODE (early_op) == ASHIFT)
+	return (!REG_P (XEXP (early_op, 0))
+		|| !REG_P (XEXP (early_op, 1)));
+      else
+	return 1;
+    }
+
+  return 0;
+}
+
 /* Return non-zero if the consumer (a multiply-accumulate instruction)
    has an accumulator dependency on the result of the producer (a
    multiplication instruction) and no other dependency on that result.  */
diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md
index 3fa0625..f7dc7e9 100644
--- a/gcc/config/arm/cortex-a53.md
+++ b/gcc/config/arm/cortex-a53.md
@@ -22,345 +22,699 @@
 (define_automaton "cortex_a53")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; Functional units.
+;; General-purpose functional units.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-;; There are two main integer execution pipelines, described as
-;; slot 0 and issue slot 1.
+;; We use slot0 and slot1 to model constraints on which instructions may
+;; dual-issue.
 
 (define_cpu_unit "cortex_a53_slot0" "cortex_a53")
 (define_cpu_unit "cortex_a53_slot1" "cortex_a53")
 
-(define_reservation "cortex_a53_slot_any" "cortex_a53_slot0|cortex_a53_slot1")
-(define_reservation "cortex_a53_single_issue" "cortex_a53_slot0+cortex_a53_slot1")
+(define_reservation "cortex_a53_slot_any"
+		    "cortex_a53_slot0\
+		     |cortex_a53_slot1")
 
-;; The load/store pipeline.  Load/store instructions can dual-issue from
-;; either pipeline, but two load/stores cannot simultaneously issue.
+(define_reservation "cortex_a53_single_issue"
+		    "cortex_a53_slot0\
+		     +cortex_a53_slot1")
 
-(define_cpu_unit "cortex_a53_ls" "cortex_a53")
-
-;; The store pipeline.  Shared between both execution pipelines.
+;; Used to model load and store pipelines.  Load/store instructions
+;; can dual-issue with other instructions, but two load/stores cannot
+;; simultaneously issue.
 
 (define_cpu_unit "cortex_a53_store" "cortex_a53")
+(define_cpu_unit "cortex_a53_load" "cortex_a53")
+(define_cpu_unit "cortex_a53_ls_agen" "cortex_a53")
 
-;; The branch pipeline.  Branches can dual-issue with other instructions
-;; (except when those instructions take multiple cycles to issue).
+;; Used to model a branch pipeline.  Branches can dual-issue with other
+;; instructions (except when those instructions take multiple cycles
+;; to issue).
 
 (define_cpu_unit "cortex_a53_branch" "cortex_a53")
 
-;; The integer divider.
+;; Used to model an integer divide pipeline.
 
 (define_cpu_unit "cortex_a53_idiv" "cortex_a53")
 
-;; The floating-point add pipeline used to model the usage
-;; of the add pipeline by fmac instructions.
-
-(define_cpu_unit "cortex_a53_fpadd_pipe" "cortex_a53")
+;; Used to model an integer multiply/multiply-accumulate pipeline.
 
-;; Floating-point div/sqrt (long latency, out-of-order completion).
+(define_cpu_unit "cortex_a53_imul" "cortex_a53")
 
-(define_cpu_unit "cortex_a53_fp_div_sqrt" "cortex_a53")
+;; Model general structural hazards, for wherever we need them.
 
-;; The Advanced SIMD pipelines.
-
-(define_cpu_unit "cortex_a53_simd0" "cortex_a53")
-(define_cpu_unit "cortex_a53_simd1" "cortex_a53")
+(define_cpu_unit "cortex_a53_hazard" "cortex_a53")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; ALU instructions.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_alu" 2
+(define_insn_reservation "cortex_a53_shift" 2
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "alu_imm,alus_imm,logic_imm,logics_imm,\
-                        alu_sreg,alus_sreg,logic_reg,logics_reg,\
-                        adc_imm,adcs_imm,adc_reg,adcs_reg,\
-                        adr,bfm,csel,clz,rbit,rev,alu_dsp_reg,\
-                        rotate_imm,shift_imm,shift_reg,\
-                        mov_imm,mov_reg,mvn_imm,mvn_reg,\
-                        mrs,multiple,no_insn"))
+       (eq_attr "type" "adr,shift_imm,shift_reg,mov_imm,mvn_imm"))
   "cortex_a53_slot_any")
 
-(define_insn_reservation "cortex_a53_alu_shift" 2
+(define_insn_reservation "cortex_a53_alu_rotate_imm" 2
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "alu_shift_imm,alus_shift_imm,\
-                        crc,logic_shift_imm,logics_shift_imm,\
-                        alu_ext,alus_ext,alu_shift_reg,alus_shift_reg,\
-                        logic_shift_reg,logics_shift_reg,\
-                        extend,mov_shift,mov_shift_reg,\
-                        mvn_shift,mvn_shift_reg"))
-  "cortex_a53_slot_any")
+       (eq_attr "type" "rotate_imm"))
+  "(cortex_a53_slot1)
+   | (cortex_a53_single_issue)")
 
-;; Forwarding path for unshifted operands.
-
-(define_bypass 1 "cortex_a53_alu,cortex_a53_alu_shift"
-  "cortex_a53_alu")
+(define_insn_reservation "cortex_a53_alu" 3
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "type" "alu_imm,alus_imm,logic_imm,logics_imm,
+			alu_sreg,alus_sreg,logic_reg,logics_reg,
+			adc_imm,adcs_imm,adc_reg,adcs_reg,
+			bfm,csel,clz,rbit,rev,alu_dsp_reg,
+			mov_reg,mvn_reg,
+			mrs,multiple,no_insn"))
+  "cortex_a53_slot_any")
 
-(define_bypass 1 "cortex_a53_alu,cortex_a53_alu_shift"
-  "cortex_a53_alu_shift"
-  "arm_no_early_alu_shift_dep")
+(define_insn_reservation "cortex_a53_alu_shift" 3
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "type" "alu_shift_imm,alus_shift_imm,
+			crc,logic_shift_imm,logics_shift_imm,
+			alu_ext,alus_ext,
+			extend,mov_shift,mvn_shift"))
+  "cortex_a53_slot_any")
 
-;; The multiplier pipeline can forward results so there's no need to specify
-;; bypasses. Multiplies can only single-issue currently.
+(define_insn_reservation "cortex_a53_alu_shift_reg" 3
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "type" "alu_shift_reg,alus_shift_reg,
+			logic_shift_reg,logics_shift_reg,
+			mov_shift_reg,mvn_shift_reg"))
+  "cortex_a53_slot_any+cortex_a53_hazard")
 
 (define_insn_reservation "cortex_a53_mul" 3
   (and (eq_attr "tune" "cortexa53")
        (ior (eq_attr "mul32" "yes")
-            (eq_attr "mul64" "yes")))
-  "cortex_a53_single_issue")
-
-;; A multiply with a single-register result or an MLA, followed by an
-;; MLA with an accumulator dependency, has its result forwarded so two
-;; such instructions can issue back-to-back.
-
-(define_bypass 1 "cortex_a53_mul"
-               "cortex_a53_mul"
-               "arm_mac_accumulator_is_mul_result")
+	    (eq_attr "mul64" "yes")))
+  "cortex_a53_slot_any+cortex_a53_imul")
 
-;; Punt with a high enough latency for divides.
-(define_insn_reservation "cortex_a53_udiv" 8
-  (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "udiv"))
-  "(cortex_a53_slot0+cortex_a53_idiv),cortex_a53_idiv*7")
+;; From the perspective of the GCC scheduling state machine, if we wish to
+;; model an instruction as serialising other instructions, we are best to do
+;; so by modelling it as taking very few cycles.  Scheduling many other
+;; instructions underneath it at the cost of freedom to pick from the
+;; ready list is likely to hurt us more than it helps.  However, we do
+;; want to model some resource and latency cost for divide instructions in
+;; order to avoid divides ending up too lumpy.
 
-(define_insn_reservation "cortex_a53_sdiv" 9
+(define_insn_reservation "cortex_a53_div" 4
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "sdiv"))
-  "(cortex_a53_slot0+cortex_a53_idiv),cortex_a53_idiv*8")
-
-
-(define_bypass 2 "cortex_a53_mul,cortex_a53_udiv,cortex_a53_sdiv"
-               "cortex_a53_alu")
-(define_bypass 2 "cortex_a53_mul,cortex_a53_udiv,cortex_a53_sdiv"
-               "cortex_a53_alu_shift"
-               "arm_no_early_alu_shift_dep")
+       (eq_attr "type" "udiv,sdiv"))
+  "cortex_a53_slot0,cortex_a53_idiv*2")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; Load/store instructions.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-;; Address-generation happens in the issue stage.
+;; TODO: load<n> is not prescriptive about how much data is to be loaded.
+;; This is most obvious for LDRD from AArch32 and LDP (X register) from
+;; AArch64, both are tagged load2 but LDP will load 128-bits compared to
+;; LDRD which is 64-bits.
+;;
+;; For the below, we assume AArch64 X-registers for load2, and AArch32
+;; registers for load3/load4.
 
-(define_insn_reservation "cortex_a53_load1" 3
+(define_insn_reservation "cortex_a53_load1" 4
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "load_byte,load1,load_acq"))
-  "cortex_a53_slot_any+cortex_a53_ls")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_load")
 
 (define_insn_reservation "cortex_a53_store1" 2
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "store1,store_rel"))
-  "cortex_a53_slot_any+cortex_a53_ls+cortex_a53_store")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store")
 
-(define_insn_reservation "cortex_a53_load2" 3
+;; Model AArch64-sized LDP Xm, Xn, [Xa]
+
+(define_insn_reservation "cortex_a53_load2" 4
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "load2"))
-  "cortex_a53_single_issue+cortex_a53_ls")
+  "cortex_a53_single_issue+cortex_a53_ls_agen,
+   cortex_a53_load+cortex_a53_slot0,
+   cortex_a53_load")
 
 (define_insn_reservation "cortex_a53_store2" 2
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "store2"))
-  "cortex_a53_single_issue+cortex_a53_ls+cortex_a53_store")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store")
+
+;; Model AArch32-sized LDM Ra, {Rm, Rn, Ro}
 
-(define_insn_reservation "cortex_a53_load3plus" 4
+(define_insn_reservation "cortex_a53_load3plus" 6
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "load3,load4"))
-  "(cortex_a53_single_issue+cortex_a53_ls)*2")
+  "cortex_a53_single_issue+cortex_a53_ls_agen,
+   cortex_a53_load+cortex_a53_slot0,
+   cortex_a53_load")
 
-(define_insn_reservation "cortex_a53_store3plus" 3
+(define_insn_reservation "cortex_a53_store3plus" 2
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "store3,store4"))
-  "(cortex_a53_single_issue+cortex_a53_ls+cortex_a53_store)*2")
-
-;; Load/store addresses are required early in Issue.
-(define_bypass 3 "cortex_a53_load1,cortex_a53_load2,cortex_a53_load3plus,cortex_a53_alu,cortex_a53_alu_shift"
-                 "cortex_a53_load*"
-                 "arm_early_load_addr_dep")
-(define_bypass 3 "cortex_a53_load1,cortex_a53_load2,cortex_a53_load3plus,cortex_a53_alu,cortex_a53_alu_shift"
-                 "cortex_a53_store*"
-                 "arm_early_store_addr_dep")
-
-;; Load data can forward in the ALU pipeline
-(define_bypass 2 "cortex_a53_load1,cortex_a53_load2"
-               "cortex_a53_alu")
-(define_bypass 2 "cortex_a53_load1,cortex_a53_load2"
-               "cortex_a53_alu_shift"
-               "arm_no_early_alu_shift_dep")
-
-;; ALU ops can forward to stores.
-(define_bypass 0 "cortex_a53_alu,cortex_a53_alu_shift"
-                 "cortex_a53_store1,cortex_a53_store2,cortex_a53_store3plus"
-                 "arm_no_early_store_addr_dep")
-
-(define_bypass 1 "cortex_a53_mul,cortex_a53_udiv,cortex_a53_sdiv,cortex_a53_load1,cortex_a53_load2,cortex_a53_load3plus"
-                 "cortex_a53_store1,cortex_a53_store2,cortex_a53_store3plus"
-                 "arm_no_early_store_addr_dep")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store+cortex_a53_slot0,
+   cortex_a53_store")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; Branches.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-;; Currently models all branches as dual-issuable from either execution
-;; slot, which isn't true for all cases. We still need to model indirect
-;; branches.
+;; Model all branches as dual-issuable from either execution, which
+;; is not strictly true for all cases (indirect branches).
 
 (define_insn_reservation "cortex_a53_branch" 0
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "branch,call"))
-  "cortex_a53_slot_any+cortex_a53_branch")
+  "cortex_a53_slot_any,cortex_a53_branch")
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; General-purpose register bypasses
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+;; Model bypasses for unshifted operands to ALU instructions.
+
+(define_bypass 1 "cortex_a53_shift"
+		 "cortex_a53_shift")
+
+(define_bypass 1 "cortex_a53_alu,
+		  cortex_a53_alu_shift*,
+		  cortex_a53_alu_rotate_imm,
+		  cortex_a53_shift"
+		 "cortex_a53_alu")
+
+(define_bypass 2 "cortex_a53_alu,
+		  cortex_a53_alu_shift*"
+		 "cortex_a53_alu_shift*"
+		 "aarch_forward_to_shift_is_not_shifted_reg")
+
+;; In our model, we allow any general-purpose register operation to
+;; bypass to the accumulator operand of an integer MADD-like operation.
+
+(define_bypass 1 "cortex_a53_alu*,
+		  cortex_a53_load*,
+		  cortex_a53_mul"
+		 "cortex_a53_mul"
+		 "aarch_accumulator_forwarding")
+
+;; Model a bypass from MLA/MUL to many ALU instructions.
+
+(define_bypass 2 "cortex_a53_mul"
+		 "cortex_a53_alu,
+		  cortex_a53_alu_shift*")
+
+;; We get neater schedules by allowing an MLA/MUL to feed an
+;; early load address dependency to a load.
+
+(define_bypass 2 "cortex_a53_mul"
+		 "cortex_a53_load*"
+		 "arm_early_load_addr_dep")
+
+;; Model bypasses for loads which are to be consumed by the ALU.
+
+(define_bypass 2 "cortex_a53_load1"
+		 "cortex_a53_alu")
+
+(define_bypass 3 "cortex_a53_load1"
+		 "cortex_a53_alu_shift*")
+
+;; Model a bypass for ALU instructions feeding stores.
+
+(define_bypass 1 "cortex_a53_alu*"
+		 "cortex_a53_store1,
+		  cortex_a53_store2,
+		  cortex_a53_store3plus"
+		 "arm_no_early_store_addr_dep")
+
+;; Model a bypass for load and multiply instructions feeding stores.
+
+(define_bypass 2 "cortex_a53_mul,
+		  cortex_a53_load1,
+		  cortex_a53_load2,
+		  cortex_a53_load3plus"
+		 "cortex_a53_store1,
+		  cortex_a53_store2,
+		  cortex_a53_store3plus"
+		 "arm_no_early_store_addr_dep")
+
+;; Model a GP->FP register move as similar to stores.
+
+(define_bypass 1 "cortex_a53_alu*"
+		 "cortex_a53_r2f")
+
+(define_bypass 2 "cortex_a53_mul,
+		  cortex_a53_load1,
+		  cortex_a53_load2,
+		  cortex_a53_load3plus"
+		 "cortex_a53_r2f")
+
+;; Shifts feeding Load/Store addresses may not be ready in time.
+
+(define_bypass 3 "cortex_a53_shift"
+		 "cortex_a53_load*"
+		 "arm_early_load_addr_dep")
+
+(define_bypass 3 "cortex_a53_shift"
+		 "cortex_a53_store*"
+		 "arm_early_store_addr_dep")
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point/Advanced SIMD.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(define_automaton "cortex_a53_advsimd")
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Broad Advanced SIMD type categorisation
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(define_attr "cortex_a53_advsimd_type"
+  "advsimd_alu, advsimd_alu_q,
+   advsimd_mul, advsimd_mul_q,
+   advsimd_div_s, advsimd_div_s_q,
+   advsimd_div_d, advsimd_div_d_q,
+   advsimd_load_64, advsimd_store_64,
+   advsimd_load_128, advsimd_store_128,
+   advsimd_load_lots, advsimd_store_lots,
+   unknown"
+  (cond [
+    (eq_attr "type" "neon_add, neon_qadd, neon_add_halve, neon_sub, neon_qsub,\
+		     neon_sub_halve, neon_abs, neon_neg, neon_qneg,\
+		     neon_qabs, neon_abd, neon_minmax, neon_compare,\
+		     neon_compare_zero, neon_arith_acc, neon_reduc_add,\
+		     neon_reduc_add_acc, neon_reduc_minmax,\
+		     neon_logic, neon_tst, neon_shift_imm,\
+		     neon_shift_reg, neon_shift_acc, neon_sat_shift_imm,\
+		     neon_sat_shift_reg, neon_ins, neon_move,\
+		     neon_permute, neon_zip, neon_tbl1,\
+		     neon_tbl2, neon_tbl3, neon_tbl4, neon_bsl,\
+		     neon_cls, neon_cnt, neon_dup,\
+		     neon_ext, neon_rbit, neon_rev,\
+		     neon_fp_abd_s, neon_fp_abd_d,\
+		     neon_fp_abs_s, neon_fp_abs_d,\
+		     neon_fp_addsub_s, neon_fp_addsub_d, neon_fp_compare_s,\
+		     neon_fp_compare_d, neon_fp_minmax_s,\
+		     neon_fp_minmax_d, neon_fp_neg_s, neon_fp_neg_d,\
+		     neon_fp_reduc_add_s, neon_fp_reduc_add_d,\
+		     neon_fp_reduc_minmax_s, neon_fp_reduc_minmax_d,\
+		     neon_fp_cvt_widen_h, neon_fp_to_int_s,neon_fp_to_int_d,\
+		     neon_int_to_fp_s, neon_int_to_fp_d, neon_fp_round_s,\
+		     neon_fp_recpe_s, neon_fp_recpe_d, neon_fp_recps_s,\
+		     neon_fp_recps_d, neon_fp_recpx_s, neon_fp_recpx_d,\
+		     neon_fp_rsqrte_s, neon_fp_rsqrte_d, neon_fp_rsqrts_s,\
+		     neon_fp_rsqrts_d")
+      (const_string "advsimd_alu")
+    (eq_attr "type" "neon_add_q, neon_add_widen, neon_add_long,\
+		     neon_qadd_q, neon_add_halve_q, neon_add_halve_narrow_q,\
+		     neon_sub_q, neon_sub_widen, neon_sub_long,\
+		     neon_qsub_q, neon_sub_halve_q, neon_sub_halve_narrow_q,\
+		     neon_abs_q, neon_neg_q, neon_qneg_q, neon_qabs_q,\
+		     neon_abd_q, neon_abd_long, neon_minmax_q,\
+		     neon_compare_q, neon_compare_zero_q,\
+		     neon_arith_acc_q, neon_reduc_add_q,\
+		     neon_reduc_add_long, neon_reduc_add_acc_q,\
+		     neon_reduc_minmax_q, neon_logic_q, neon_tst_q,\
+		     neon_shift_imm_q, neon_shift_imm_narrow_q,\
+		     neon_shift_imm_long, neon_shift_reg_q,\
+		     neon_shift_acc_q, neon_sat_shift_imm_q,\
+		     neon_sat_shift_imm_narrow_q, neon_sat_shift_reg_q,\
+		     neon_ins_q, neon_move_q, neon_move_narrow_q,\
+		     neon_permute_q, neon_zip_q,\
+		     neon_tbl1_q, neon_tbl2_q, neon_tbl3_q,\
+		     neon_tbl4_q, neon_bsl_q, neon_cls_q, neon_cnt_q,\
+		     neon_dup_q, neon_ext_q, neon_rbit_q,\
+		     neon_rev_q, neon_fp_abd_s_q, neon_fp_abd_d_q,\
+		     neon_fp_abs_s_q, neon_fp_abs_d_q,\
+		     neon_fp_addsub_s_q, neon_fp_addsub_d_q,\
+		     neon_fp_compare_s_q, neon_fp_compare_d_q,\
+		     neon_fp_minmax_s_q, neon_fp_minmax_d_q,\
+		     neon_fp_cvt_widen_s, neon_fp_neg_s_q, neon_fp_neg_d_q,\
+		     neon_fp_reduc_add_s_q, neon_fp_reduc_add_d_q,\
+		     neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d_q,\
+		     neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
+		     neon_fp_to_int_s_q, neon_fp_to_int_d_q,\
+		     neon_int_to_fp_s_q, neon_int_to_fp_d_q,\
+		     neon_fp_round_s_q,\
+		     neon_fp_recpe_s_q, neon_fp_recpe_d_q,\
+		     neon_fp_recps_s_q, neon_fp_recps_d_q,\
+		     neon_fp_recpx_s_q, neon_fp_recpx_d_q,\
+		     neon_fp_rsqrte_s_q, neon_fp_rsqrte_d_q,\
+		     neon_fp_rsqrts_s_q, neon_fp_rsqrts_d_q")
+      (const_string "advsimd_alu_q")
+    (eq_attr "type" "neon_mul_b, neon_mul_h, neon_mul_s,\
+		     neon_mul_h_scalar, neon_mul_s_scalar,\
+		     neon_sat_mul_b, neon_sat_mul_h, neon_sat_mul_s,\
+		     neon_sat_mul_h_scalar, neon_sat_mul_s_scalar,\
+		     neon_mla_b, neon_mla_h, neon_mla_s,\
+		     neon_mla_h_scalar, neon_mla_s_scalar,\
+		     neon_fp_mul_s, neon_fp_mul_s_scalar,\
+		     neon_fp_mul_d, neon_fp_mla_s,\
+		     neon_fp_mla_s_scalar, neon_fp_mla_d")
+      (const_string "advsimd_mul")
+    (eq_attr "type" "neon_mul_b_q, neon_mul_h_q, neon_mul_s_q,\
+		     neon_mul_b_long, neon_mul_h_long, neon_mul_s_long,\
+		     neon_mul_d_long, neon_mul_h_scalar_q,\
+		     neon_mul_s_scalar_q, neon_mul_h_scalar_long,\
+		     neon_mul_s_scalar_long, neon_sat_mul_b_q,\
+		     neon_sat_mul_h_q, neon_sat_mul_s_q,\
+		     neon_sat_mul_b_long, neon_sat_mul_h_long,\
+		     neon_sat_mul_s_long, neon_sat_mul_h_scalar_q,\
+		     neon_sat_mul_s_scalar_q, neon_sat_mul_h_scalar_long,\
+		     neon_sat_mul_s_scalar_long, neon_mla_b_q,\
+		     neon_mla_h_q, neon_mla_s_q, neon_mla_b_long,\
+		     neon_mla_h_long, neon_mla_s_long,\
+		     neon_mla_h_scalar_q, neon_mla_s_scalar_q,\
+		     neon_mla_h_scalar_long, neon_mla_s_scalar_long,\
+		     neon_sat_mla_b_long, neon_sat_mla_h_long,\
+		     neon_sat_mla_s_long, neon_sat_mla_h_scalar_long,\
+		     neon_sat_mla_s_scalar_long,\
+		     neon_fp_mul_s_q, neon_fp_mul_s_scalar_q,\
+		     neon_fp_mul_d_q, neon_fp_mul_d_scalar_q,\
+		     neon_fp_mla_s_q, neon_fp_mla_s_scalar_q,\
+		     neon_fp_mla_d_q, neon_fp_mla_d_scalar_q")
+      (const_string "advsimd_mul_q")
+    (eq_attr "type" "neon_fp_sqrt_s, neon_fp_div_s")
+      (const_string "advsimd_div_s")
+    (eq_attr "type" "neon_fp_sqrt_s_q, neon_fp_div_s_q")
+      (const_string "advsimd_div_s_q")
+    (eq_attr "type" "neon_fp_sqrt_d, neon_fp_div_d")
+      (const_string "advsimd_div_d")
+    (eq_attr "type" "neon_fp_sqrt_d_q, neon_fp_div_d_q")
+      (const_string "advsimd_div_d_q")
+    (eq_attr "type" "neon_ldr, neon_load1_1reg,\
+		     neon_load1_all_lanes, neon_load1_all_lanes_q,\
+		     neon_load1_one_lane, neon_load1_one_lane_q")
+      (const_string "advsimd_load_64")
+    (eq_attr "type" "neon_str, neon_store1_1reg,\
+		     neon_store1_one_lane,neon_store1_one_lane_q")
+      (const_string "advsimd_store_64")
+    (eq_attr "type" "neon_load1_1reg_q, neon_load1_2reg,\
+		     neon_load2_2reg,\
+		     neon_load2_all_lanes, neon_load2_all_lanes_q,\
+		     neon_load2_one_lane, neon_load2_one_lane_q")
+      (const_string "advsimd_load_128")
+    (eq_attr "type" "neon_store1_1reg_q, neon_store1_2reg,\
+		     neon_store2_2reg,\
+		     neon_store2_one_lane, neon_store2_one_lane_q")
+      (const_string "advsimd_store_128")
+    (eq_attr "type" "neon_load1_2reg_q, neon_load1_3reg, neon_load1_3reg_q,\
+		     neon_load1_4reg, neon_load1_4reg_q, \
+		     neon_load2_2reg_q, neon_load2_4reg,\
+		     neon_load2_4reg_q, neon_load3_3reg,\
+		     neon_load3_3reg_q, neon_load3_all_lanes,\
+		     neon_load3_all_lanes_q, neon_load3_one_lane,\
+		     neon_load3_one_lane_q, neon_load4_4reg,\
+		     neon_load4_4reg_q, neon_load4_all_lanes,\
+		     neon_load4_all_lanes_q, neon_load4_one_lane,\
+		     neon_load4_one_lane_q")
+      (const_string "advsimd_load_lots")
+    (eq_attr "type" "neon_store1_2reg_q, neon_store1_3reg,\
+		     neon_store1_3reg_q, neon_store1_4reg,\
+		     neon_store1_4reg_q, neon_store2_2reg_q,\
+		     neon_store2_4reg, neon_store2_4reg_q,\
+		     neon_store3_3reg, neon_store3_3reg_q,\
+		     neon_store3_one_lane, neon_store3_one_lane_q,\
+		     neon_store4_4reg, neon_store4_4reg_q,\
+		     neon_store4_one_lane, neon_store4_one_lane_q")
+      (const_string "advsimd_store_lots")]
+      (const_string "unknown")))
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point/Advanced SIMD functional units.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+;; We model the Advanced SIMD unit as two 64-bit units, each with three
+;; pipes, FP_ALU, FP_MUL, FP_DIV.  We also give convenient reservations
+;; for 128-bit Advanced SIMD instructions, which use both units.
+
+;; The floating-point/Advanced SIMD ALU pipelines.
+
+(define_cpu_unit "cortex_a53_fp_alu_lo,\
+		  cortex_a53_fp_alu_hi"
+		 "cortex_a53_advsimd")
+
+(define_reservation "cortex_a53_fp_alu"
+		    "cortex_a53_fp_alu_lo\
+		     |cortex_a53_fp_alu_hi")
+
+(define_reservation "cortex_a53_fp_alu_q"
+		    "cortex_a53_fp_alu_lo\
+		     +cortex_a53_fp_alu_hi")
+
+;; The floating-point/Advanced SIMD multiply/multiply-accumulate
+;; pipelines.
+
+(define_cpu_unit "cortex_a53_fp_mul_lo,\
+		  cortex_a53_fp_mul_hi"
+		 "cortex_a53_advsimd")
+
+(define_reservation "cortex_a53_fp_mul"
+		    "cortex_a53_fp_mul_lo\
+		     |cortex_a53_fp_mul_hi")
+
+(define_reservation "cortex_a53_fp_mul_q"
+		    "cortex_a53_fp_mul_lo\
+		     +cortex_a53_fp_mul_hi")
+
+;; Floating-point/Advanced SIMD divide/square root.
+
+(define_cpu_unit "cortex_a53_fp_div_lo,\
+		  cortex_a53_fp_div_hi"
+		 "cortex_a53_advsimd")
+
+;; Once we choose a pipe, stick with it for three simulated cycles.
+
+(define_reservation "cortex_a53_fp_div"
+		    "(cortex_a53_fp_div_lo*3)\
+		     |(cortex_a53_fp_div_hi*3)")
+
+(define_reservation "cortex_a53_fp_div_q"
+		    "(cortex_a53_fp_div_lo*3)\
+		     +(cortex_a53_fp_div_hi*3)")
+
+;; Cryptographic extensions
+
+(define_cpu_unit "cortex_a53_crypto"
+		 "cortex_a53_advsimd")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; Floating-point arithmetic.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_fpalu" 4
+(define_insn_reservation "cortex_a53_fpalu" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "ffariths, fadds, ffarithd, faddd, fmov, fmuls,\
-                        f_cvt,f_cvtf2i,f_cvti2f,\
-                        fcmps, fcmpd, fcsel, f_rints, f_rintd, f_minmaxs,\
-                        f_minmaxd"))
-  "cortex_a53_slot0+cortex_a53_fpadd_pipe")
+	(eq_attr "type" "ffariths, fadds, ffarithd, faddd, fmov,
+			f_cvt, fcmps, fcmpd, fcsel, f_rints, f_rintd,
+			f_minmaxs, f_minmaxd"))
+  "cortex_a53_slot_any,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_fconst" 2
+(define_insn_reservation "cortex_a53_fconst" 3
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "fconsts,fconstd"))
-  "cortex_a53_slot0+cortex_a53_fpadd_pipe")
+  "cortex_a53_slot_any,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_fpmul" 4
+(define_insn_reservation "cortex_a53_fpmul" 5
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "fmuls,fmuld"))
-  "cortex_a53_slot0")
+  "cortex_a53_slot_any,cortex_a53_fp_mul")
 
-;; For single-precision multiply-accumulate, the add (accumulate) is issued after
-;; the multiply completes. Model that accordingly.
+;; For multiply-accumulate, model the add (accumulate) as being issued
+;; after the multiply completes.
 
 (define_insn_reservation "cortex_a53_fpmac" 8
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "fmacs,fmacd,ffmas,ffmad"))
-  "cortex_a53_slot0, nothing*3, cortex_a53_fpadd_pipe")
+  "cortex_a53_slot_any,cortex_a53_fp_mul,
+   nothing*3, cortex_a53_fp_alu")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; Floating-point divide/square root instructions.
+;; Floating-point to/from core transfers.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; fsqrt really takes one cycle less, but that is not modelled.
 
-(define_insn_reservation "cortex_a53_fdivs" 14
+(define_insn_reservation "cortex_a53_r2f" 6
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "fdivs, fsqrts"))
-  "cortex_a53_slot0, cortex_a53_fp_div_sqrt * 5")
+       (eq_attr "type" "f_mcr,f_mcrr,f_cvti2f,
+			neon_from_gp, neon_from_gp_q"))
+  "cortex_a53_slot_any,cortex_a53_store,
+   nothing,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_fdivd" 29
+(define_insn_reservation "cortex_a53_f2r" 6
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "fdivd, fsqrtd"))
-  "cortex_a53_slot0, cortex_a53_fp_div_sqrt * 8")
+       (eq_attr "type" "f_mrc,f_mrrc,f_cvtf2i,
+			neon_to_gp, neon_to_gp_q"))
+  "cortex_a53_slot_any,cortex_a53_fp_alu,
+   nothing,cortex_a53_store")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; ARMv8-A Cryptographic extensions.
+;; Floating-point flag transfer.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_crypto_aese" 2
+(define_insn_reservation "cortex_a53_f_flags" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_aese"))
-  "cortex_a53_simd0")
+       (eq_attr "type" "f_flag"))
+  "cortex_a53_slot_any")
 
-(define_insn_reservation "cortex_a53_crypto_aesmc" 2
-  (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_aesmc"))
-  "cortex_a53_simd0 | cortex_a53_simd1")
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point load/store.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_crypto_sha1_fast" 2
+(define_insn_reservation "cortex_a53_f_load_64" 4
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_sha1_fast, crypto_sha256_fast"))
-  "cortex_a53_simd0")
+       (ior (eq_attr "type" "f_loads,f_loadd")
+	    (eq_attr "cortex_a53_advsimd_type"
+		     "advsimd_load_64")))
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_load")
 
-(define_insn_reservation "cortex_a53_crypto_sha1_xor" 3
+(define_insn_reservation "cortex_a53_f_load_many" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_sha1_xor"))
-  "cortex_a53_simd0")
+       (eq_attr "cortex_a53_advsimd_type"
+		"advsimd_load_128,advsimd_load_lots"))
+  "cortex_a53_single_issue+cortex_a53_ls_agen,
+   cortex_a53_load+cortex_a53_slot0,
+   cortex_a53_load")
 
-(define_insn_reservation "cortex_a53_crypto_sha_slow" 5
+(define_insn_reservation "cortex_a53_f_store_64" 0
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_sha1_slow, crypto_sha256_slow"))
-  "cortex_a53_simd0")
+       (ior (eq_attr "type" "f_stores,f_stored")
+	    (eq_attr "cortex_a53_advsimd_type"
+		     "advsimd_store_64")))
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store")
 
-(define_bypass 0 "cortex_a53_crypto_aese"
-                 "cortex_a53_crypto_aesmc"
-                 "aarch_crypto_can_dual_issue")
+(define_insn_reservation "cortex_a53_f_store_many" 0
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "cortex_a53_advsimd_type"
+		"advsimd_store_128,advsimd_store_lots"))
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store+cortex_a53_slot0,
+   cortex_a53_store")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; VFP to/from core transfers.
+;; Advanced SIMD.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_r2f" 4
+;; Either we want to model use of the ALU pipe, the multiply pipe or the
+;; divide/sqrt pipe.  In all cases we need to check if we are a 64-bit
+;; operation (in which case we model dual-issue without penalty)
+;; or a 128-bit operation in which case we require in our model that we
+;; issue from slot 0.
+
+(define_insn_reservation "cortex_a53_advsimd_alu" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_mcr,f_mcrr"))
-  "cortex_a53_slot0")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_alu"))
+  "cortex_a53_slot_any,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_f2r" 2
+(define_insn_reservation "cortex_a53_advsimd_alu_q" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_mrc,f_mrrc"))
-  "cortex_a53_slot0")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_alu_q"))
+  "cortex_a53_slot0,cortex_a53_fp_alu_q")
 
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; VFP flag transfer.
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+(define_insn_reservation "cortex_a53_advsimd_mul" 5
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_mul"))
+  "cortex_a53_slot_any,cortex_a53_fp_mul")
 
-(define_insn_reservation "cortex_a53_f_flags" 4
+(define_insn_reservation "cortex_a53_advsimd_mul_q" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_flag"))
-  "cortex_a53_slot0")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_mul_q"))
+  "cortex_a53_slot0,cortex_a53_fp_mul_q")
+
+;; SIMD Dividers.
+
+(define_insn_reservation "cortex_a53_advsimd_div_s" 14
+  (and (eq_attr "tune" "cortexa53")
+       (ior (eq_attr "type" "fdivs,fsqrts")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_div_s")))
+  "cortex_a53_slot0,cortex_a53_fp_mul,
+   cortex_a53_fp_div")
+
+(define_insn_reservation "cortex_a53_advsimd_div_d" 29
+  (and (eq_attr "tune" "cortexa53")
+       (ior (eq_attr "type" "fdivd,fsqrtd")
+	    (eq_attr "cortex_a53_advsimd_type" "advsimd_div_d")))
+  "cortex_a53_slot0,cortex_a53_fp_mul,
+   cortex_a53_fp_div")
+
+(define_insn_reservation "cortex_a53_advsimd_div_s_q" 14
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_div_s_q"))
+  "cortex_a53_single_issue,cortex_a53_fp_mul_q,
+   cortex_a53_fp_div_q")
+
+(define_insn_reservation "cortex_a53_advsimd_divd_q" 29
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_div_d_q"))
+  "cortex_a53_single_issue,cortex_a53_fp_mul_q,
+   cortex_a53_fp_div_q")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; VFP load/store.
+;; ARMv8-A Cryptographic extensions.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_f_loads" 4
+;; We want AESE and AESMC to end up consecutive to one another.
+
+(define_insn_reservation "cortex_a53_crypto_aese" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_loads"))
+       (eq_attr "type" "crypto_aese"))
   "cortex_a53_slot0")
 
-(define_insn_reservation "cortex_a53_f_loadd" 5
+(define_insn_reservation "cortex_a53_crypto_aesmc" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_loadd"))
-  "cortex_a53_slot0")
+       (eq_attr "type" "crypto_aesmc"))
+  "cortex_a53_slot_any")
 
-(define_insn_reservation "cortex_a53_f_load_2reg" 5
+;; SHA1H
+
+(define_insn_reservation "cortex_a53_crypto_sha1_fast" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "neon_load2_2reg_q"))
-  "(cortex_a53_slot_any+cortex_a53_ls)*2")
+       (eq_attr "type" "crypto_sha1_fast"))
+  "cortex_a53_slot_any,cortex_a53_crypto")
 
-(define_insn_reservation "cortex_a53_f_loadq" 5
+(define_insn_reservation "cortex_a53_crypto_sha256_fast" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "neon_load1_1reg_q"))
-  "cortex_a53_slot_any+cortex_a53_ls")
+       (eq_attr "type" "crypto_sha256_fast"))
+  "cortex_a53_slot0,cortex_a53_crypto")
 
-(define_insn_reservation "cortex_a53_f_stores" 0
+(define_insn_reservation "cortex_a53_crypto_sha1_xor" 4
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_stores"))
-  "cortex_a53_slot0")
+       (eq_attr "type" "crypto_sha1_xor"))
+  "cortex_a53_slot0,cortex_a53_crypto")
 
-(define_insn_reservation "cortex_a53_f_stored" 0
+(define_insn_reservation "cortex_a53_crypto_sha_slow" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_stored"))
-  "cortex_a53_slot0")
+       (eq_attr "type" "crypto_sha1_slow, crypto_sha256_slow"))
+  "cortex_a53_slot0,cortex_a53_crypto")
 
-;; Load-to-use for floating-point values has a penalty of one cycle,
-;; i.e. a latency of two.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point/Advanced SIMD register bypasses.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_bypass 2 "cortex_a53_f_loads"
-                 "cortex_a53_fpalu, cortex_a53_fpmac, cortex_a53_fpmul,\
-		  cortex_a53_fdivs, cortex_a53_fdivd,\
-		  cortex_a53_f2r")
+;; Model the late use of the accumulator operand for floating-point
+;; multiply-accumulate operations as a bypass reducing the latency
+;; of producing instructions to near zero.
 
-(define_bypass 2 "cortex_a53_f_loadd"
-                 "cortex_a53_fpalu, cortex_a53_fpmac, cortex_a53_fpmul,\
-		  cortex_a53_fdivs, cortex_a53_fdivd,\
-		  cortex_a53_f2r")
+(define_bypass 1 "cortex_a53_fp*,
+		  cortex_a53_r2f,
+		  cortex_a53_f_load*"
+		 "cortex_a53_fpmac"
+		 "aarch_accumulator_forwarding")
 
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; Crude Advanced SIMD approximation.
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Model a bypass from the result of an FP operation to a use.
+
+(define_bypass 4 "cortex_a53_fpalu,
+		  cortex_a53_fpmul"
+		 "cortex_a53_fpalu,
+		  cortex_a53_fpmul,
+		  cortex_a53_fpmac,
+		  cortex_a53_advsimd_div*")
+
+;; We want AESE and AESMC to end up consecutive to one another.
+
+(define_bypass 0 "cortex_a53_crypto_aese"
+		 "cortex_a53_crypto_aesmc"
+		 "aarch_crypto_can_dual_issue")
 
-(define_insn_reservation "cortex_a53_advsimd" 4
-  (and (eq_attr "tune" "cortexa53")
-       (eq_attr "is_neon_type" "yes"))
-  "cortex_a53_simd0")

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute
  2015-09-25  8:30 ` [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute James Greenhalgh
@ 2015-09-25 13:19   ` Kyrill Tkachov
  2015-09-25 13:49     ` James Greenhalgh
  0 siblings, 1 reply; 11+ messages in thread
From: Kyrill Tkachov @ 2015-09-25 13:19 UTC (permalink / raw)
  To: James Greenhalgh, gcc-patches
  Cc: Marcus Shawcroft, Richard Earnshaw, Ramana Radhakrishnan

Hi James,

On 25/09/15 08:59, James Greenhalgh wrote:
> Hi,
>
> This patch splits the "shift_imm" type attribute used by AArch64 in
> two - giving rotate_imm and shift_imm.
>
> We then apply this transform across the AArch64 pipeline descriptions
> which have modelling for shift_imm (cortex-a53, cortex-a57, thunderx).
> This should give no functional change to these models.
>
> Bootstrapped and tested on aarch64-none-linux-gnu, and
> arm-none-linux-gnueabihf with no issues.
>
> OK?
>
> Thanks,
> James
>
> ---
> 2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>
>
> 	* config/arm/types.md (type): Add rotate_imm.
> 	* config/aarch64/aarch64.md (*ror<mode>3_insn): Split out the
> 	ROR immediate case.
> 	(*rorsi3_insn_uxtw): Likewise.
> 	* config/aarch64/thunderx.md (thunderx_shift): Add rotate_imm.
> 	* config/arm/cortex-a53.md (cortex_a53_alu_shift): Add rotate_imm.
> 	* config/arm/cortex-a57.md (cortex_a53_alu): Add rotate_imm.
>

--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3807,13 +3807,15 @@
  
  ;; Rotate right
  (define_insn "*ror<mode>3_insn"
-  [(set (match_operand:GPI 0 "register_operand" "=r")
-        (rotatert:GPI
-          (match_operand:GPI 1 "register_operand" "r")
-          (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "rUs<cmode>")))]
+  [(set (match_operand:GPI 0 "register_operand" "=r,r")
+     (rotatert:GPI
+       (match_operand:GPI 1 "register_operand" "r,r")
+       (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "r,Us<cmode>")))]
    ""
-  "ror\\t%<w>0, %<w>1, %<w>2"
-  [(set_attr "type" "shift_reg")]
+  "@
+   ror\\t%<w>0, %<w>1, %<w>2
+   ror\\t%<w>0, %<w>1, %<w>2"
+  [(set_attr "type" "shift_reg, rotate_imm")]
  )
  
AFAIK since the output template for the two alternatives is identical you don't need to specify multiple
identical output templates using '@'. You can just specify the alternative values for the "type" attribute.
See the *sub_shiftsi pattern in the arm backend for an example of that.

arm-wise this patch is ok since you don't actually introduce usage of the new type to any arm patterns.

Thanks,
Kyrill


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute
  2015-09-25 13:19   ` Kyrill Tkachov
@ 2015-09-25 13:49     ` James Greenhalgh
  2015-09-25 15:05       ` Marcus Shawcroft
  0 siblings, 1 reply; 11+ messages in thread
From: James Greenhalgh @ 2015-09-25 13:49 UTC (permalink / raw)
  To: gcc-patches
  Cc: kyrtka01, ramana.radhakrishnan, marcus.shawcroft, richard.earnshaw

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]


On Fri, Sep 25, 2015 at 01:53:35PM +0100, Kyrill Tkachov wrote:
> Hi James,
>
> On 25/09/15 08:59, James Greenhalgh wrote:
> > Hi,
> >
> > This patch splits the "shift_imm" type attribute used by AArch64 in
> > two - giving rotate_imm and shift_imm.
> >
> > We then apply this transform across the AArch64 pipeline descriptions
> > which have modelling for shift_imm (cortex-a53, cortex-a57, thunderx).
> > This should give no functional change to these models.
> >
> > Bootstrapped and tested on aarch64-none-linux-gnu, and
> > arm-none-linux-gnueabihf with no issues.
> >
> > OK?
> >
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -3807,13 +3807,15 @@
>
>   ;; Rotate right
>   (define_insn "*ror<mode>3_insn"
> -  [(set (match_operand:GPI 0 "register_operand" "=r")
> -        (rotatert:GPI
> -          (match_operand:GPI 1 "register_operand" "r")
> -          (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "rUs<cmode>")))]
> +  [(set (match_operand:GPI 0 "register_operand" "=r,r")
> +     (rotatert:GPI
> +       (match_operand:GPI 1 "register_operand" "r,r")
> +       (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "r,Us<cmode>")))]
>     ""
> -  "ror\\t%<w>0, %<w>1, %<w>2"
> -  [(set_attr "type" "shift_reg")]
> +  "@
> +   ror\\t%<w>0, %<w>1, %<w>2
> +   ror\\t%<w>0, %<w>1, %<w>2"
> +  [(set_attr "type" "shift_reg, rotate_imm")]
>   )
>
> AFAIK since the output template for the two alternatives is identical you
> don't need to specify multiple identical output templates using '@'. You can
> just specify the alternative values for the "type" attribute.
> See the *sub_shiftsi pattern in the arm backend for an example of that.
>
> arm-wise this patch is ok since you don't actually introduce usage of the new
> type to any arm patterns.

Thanks for the review and the tip. Patch updated and attached after
checking that the compiler still builds, and that we still schedule
appropriately.

Thanks,
James

---
2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/arm/types.md (type): Add rotate_imm.
	* config/aarch64/aarch64.md (*ror<mode>3_insn): Split out the
	ROR immediate case.
	(*rorsi3_insn_uxtw): Likewise.
	* config/aarch64/thunderx.md (thunderx_shift): Add rotate_imm.
	* config/arm/cortex-a53.md (cortex_a53_alu_shift): Add rotate_imm.
	* config/arm/cortex-a57.md (cortex_a53_alu): Add rotate_imm.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Re-Patch-1-2-AArch64-ARM-Give-AArch64-ROR-Immediate-.patch --]
[-- Type: text/x-patch;  name=0001-Re-Patch-1-2-AArch64-ARM-Give-AArch64-ROR-Immediate-.patch, Size: 3582 bytes --]

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 78b9ae2..090ad65 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3807,13 +3807,13 @@
 
 ;; Rotate right
 (define_insn "*ror<mode>3_insn"
-  [(set (match_operand:GPI 0 "register_operand" "=r")
-        (rotatert:GPI
-          (match_operand:GPI 1 "register_operand" "r")
-          (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "rUs<cmode>")))]
+  [(set (match_operand:GPI 0 "register_operand" "=r,r")
+     (rotatert:GPI
+       (match_operand:GPI 1 "register_operand" "r,r")
+       (match_operand:QI 2 "aarch64_reg_or_shift_imm_<mode>" "r,Us<cmode>")))]
   ""
   "ror\\t%<w>0, %<w>1, %<w>2"
-  [(set_attr "type" "shift_reg")]
+  [(set_attr "type" "shift_reg, rotate_imm")]
 )
 
 ;; zero_extend version of above
@@ -3902,7 +3902,7 @@
   operands[3] = GEN_INT (<sizen> - UINTVAL (operands[2]));
   return "ror\\t%<w>0, %<w>1, %3";
 }
-  [(set_attr "type" "shift_imm")]
+  [(set_attr "type" "rotate_imm")]
 )
 
 ;; zero_extend version of the above
@@ -3916,7 +3916,7 @@
   operands[3] = GEN_INT (32 - UINTVAL (operands[2]));
   return "ror\\t%w0, %w1, %3";
 }
-  [(set_attr "type" "shift_imm")]
+  [(set_attr "type" "rotate_imm")]
 )
 
 (define_insn "*<ANY_EXTEND:optab><GPI:mode>_ashl<SHORT:mode>"
diff --git a/gcc/config/aarch64/thunderx.md b/gcc/config/aarch64/thunderx.md
index cf96368..3dae963 100644
--- a/gcc/config/aarch64/thunderx.md
+++ b/gcc/config/aarch64/thunderx.md
@@ -39,7 +39,7 @@
 
 (define_insn_reservation "thunderx_shift" 1
   (and (eq_attr "tune" "thunderx")
-       (eq_attr "type" "bfm,extend,shift_imm,shift_reg,rbit,rev"))
+       (eq_attr "type" "bfm,extend,rotate_imm,shift_imm,shift_reg,rbit,rev"))
   "thunderx_pipe0 | thunderx_pipe1")
 
 
diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md
index db572f6..3fa0625 100644
--- a/gcc/config/arm/cortex-a53.md
+++ b/gcc/config/arm/cortex-a53.md
@@ -76,7 +76,7 @@
                         alu_sreg,alus_sreg,logic_reg,logics_reg,\
                         adc_imm,adcs_imm,adc_reg,adcs_reg,\
                         adr,bfm,csel,clz,rbit,rev,alu_dsp_reg,\
-                        shift_imm,shift_reg,\
+                        rotate_imm,shift_imm,shift_reg,\
                         mov_imm,mov_reg,mvn_imm,mvn_reg,\
                         mrs,multiple,no_insn"))
   "cortex_a53_slot_any")
diff --git a/gcc/config/arm/cortex-a57.md b/gcc/config/arm/cortex-a57.md
index a32c848..d6ce440 100644
--- a/gcc/config/arm/cortex-a57.md
+++ b/gcc/config/arm/cortex-a57.md
@@ -296,7 +296,7 @@
 			alu_sreg,alus_sreg,logic_reg,logics_reg,\
 			adc_imm,adcs_imm,adc_reg,adcs_reg,\
 			adr,bfm,clz,rbit,rev,alu_dsp_reg,\
-			shift_imm,shift_reg,\
+			rotate_imm,shift_imm,shift_reg,\
 			mov_imm,mov_reg,\
 			mvn_imm,mvn_reg,\
 			mrs,multiple,no_insn"))
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index ec609ae..534be74 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -120,6 +120,7 @@
 ;                    final output, thus having no impact on scheduling.
 ; rbit               reverse bits.
 ; rev                reverse bytes.
+; rotate_imm         rotate by immediate.
 ; sdiv               signed division.
 ; shift_imm          simple shift operation (LSL, LSR, ASR, ROR) with an
 ;                    immediate.
@@ -627,6 +628,7 @@
   nop,\
   rbit,\
   rev,\
+  rotate_imm,\
   sdiv,\
   shift_imm,\
   shift_reg,\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute
  2015-09-25 13:49     ` James Greenhalgh
@ 2015-09-25 15:05       ` Marcus Shawcroft
  0 siblings, 0 replies; 11+ messages in thread
From: Marcus Shawcroft @ 2015-09-25 15:05 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: gcc-patches, kyrtka01, ramana.radhakrishnan, Marcus Shawcroft,
	Richard Earnshaw

On 25 September 2015 at 14:19, James Greenhalgh
<james.greenhalgh@arm.com> wrote:

> 2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * config/arm/types.md (type): Add rotate_imm.
>         * config/aarch64/aarch64.md (*ror<mode>3_insn): Split out the
>         ROR immediate case.
>         (*rorsi3_insn_uxtw): Likewise.
>         * config/aarch64/thunderx.md (thunderx_shift): Add rotate_imm.
>         * config/arm/cortex-a53.md (cortex_a53_alu_shift): Add rotate_imm.
>         * config/arm/cortex-a57.md (cortex_a53_alu): Add rotate_imm.

OK with me.  /Marcus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model
  2015-09-25  8:37 ` [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
@ 2015-10-01  8:31   ` Kyrill Tkachov
  2015-10-01  8:32   ` Marcus Shawcroft
  1 sibling, 0 replies; 11+ messages in thread
From: Kyrill Tkachov @ 2015-10-01  8:31 UTC (permalink / raw)
  To: James Greenhalgh, gcc-patches
  Cc: Marcus Shawcroft, Richard Earnshaw, Ramana Radhakrishnan


On 25/09/15 08:59, James Greenhalgh wrote:
> Hi,

Hi James,

>
> This patch introduces a new scheduling model for Cortex-A53.
>
> Bootstrapped and tested on arm-none-linux-gnueabi and aarch64-none-linux-gnu
> and checked with a variety of popular benchmarking and microbenchmarking
> suites to show a benefit.
>
> OK?
>
> Thanks,
> James
>
> ---
> 2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>
>
> 	* config/arm/aarch-common-protos.h
> 	(aarch_accumulator_forwarding): New.
> 	(aarch_forward_to_shift_is_not_shifted_reg): Likewise.
> 	* config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
> 	(aarch_forward_to_shift_is_not_shifted_reg): Liekwise.

s/Liekwise/Likewise/

> 	* config/arm/cortex-a53.md: Rewrite.
>

This is ok arm-wise.
Thanks,
Kyrill

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model
  2015-09-25  8:37 ` [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
  2015-10-01  8:31   ` Kyrill Tkachov
@ 2015-10-01  8:32   ` Marcus Shawcroft
  2015-10-01  9:41     ` James Greenhalgh
  1 sibling, 1 reply; 11+ messages in thread
From: Marcus Shawcroft @ 2015-10-01  8:32 UTC (permalink / raw)
  To: James Greenhalgh, gcc-patches
  Cc: Richard Earnshaw, Kyrylo Tkachov, Ramana Radhakrishnan

On 25/09/15 08:59, James Greenhalgh wrote:
>
> Hi,
>
> This patch introduces a new scheduling model for Cortex-A53.
>
> Bootstrapped and tested on arm-none-linux-gnueabi and aarch64-none-linux-gnu
> and checked with a variety of popular benchmarking and microbenchmarking
> suites to show a benefit.
>
> OK?
>
> Thanks,
> James
>
> ---
> 2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>
>
> 	* config/arm/aarch-common-protos.h
> 	(aarch_accumulator_forwarding): New.
> 	(aarch_forward_to_shift_is_not_shifted_reg): Likewise.
> 	* config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
> 	(aarch_forward_to_shift_is_not_shifted_reg): Liekwise.
> 	* config/arm/cortex-a53.md: Rewrite.
>

OK aarch64 with Kyrill's comments fixed.
/M

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model
  2015-10-01  8:32   ` Marcus Shawcroft
@ 2015-10-01  9:41     ` James Greenhalgh
  2015-10-05 10:07       ` Christophe Lyon
  0 siblings, 1 reply; 11+ messages in thread
From: James Greenhalgh @ 2015-10-01  9:41 UTC (permalink / raw)
  To: Marcus Shawcroft
  Cc: gcc-patches, Richard Earnshaw, Kyrylo Tkachov, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]

On Thu, Oct 01, 2015 at 09:33:07AM +0100, Marcus Shawcroft wrote:
> On 25/09/15 08:59, James Greenhalgh wrote:
> >
> > Hi,
> >
> > This patch introduces a new scheduling model for Cortex-A53.
> >
> > Bootstrapped and tested on arm-none-linux-gnueabi and aarch64-none-linux-gnu
> > and checked with a variety of popular benchmarking and microbenchmarking
> > suites to show a benefit.
> >
> > OK?
> >
> > Thanks,
> > James
> >
> > ---
> > 2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>
> >
> > 	* config/arm/aarch-common-protos.h
> > 	(aarch_accumulator_forwarding): New.
> > 	(aarch_forward_to_shift_is_not_shifted_reg): Likewise.
> > 	* config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
> > 	(aarch_forward_to_shift_is_not_shifted_reg): Liekwise.
> > 	* config/arm/cortex-a53.md: Rewrite.
> >
> 
> OK aarch64 with Kyrill's comments fixed.
> /M

Thanks,

I had to rebase this over Evandro's recent patch adding neon_ldp/neon_ldp_q
types to the old scheduling model. The rebase was obvious to resolve, and
while I was there I also added the neon_stp/neon_stp_q types which were
missing.

I've attached what I ultimately committed as revision 228324. I messed up
fixing the ChangeLog typo before commit, so that is revision 228325.

Thanks,
James


[-- Attachment #2: rebased.diff --]
[-- Type: text/x-diff, Size: 40046 bytes --]

Index: gcc/ChangeLog
===================================================================
--- gcc/ChangeLog	(revision 228323)
+++ gcc/ChangeLog	(working copy)
@@ -1,3 +1,12 @@
+2015-10-01  James Greenhalgh  <james.greenhalgh@arm.com>
+
+	* config/arm/aarch-common-protos.h
+	(aarch_accumulator_forwarding): New.
+	(aarch_forward_to_shift_is_not_shifted_reg): Likewise.
+	* config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
+	(aarch_forward_to_shift_is_not_shifted_reg): Liekwise.
+	* config/arm/cortex-a53.md: Rewrite.
+
 2015-10-01  Richard Biener  <rguenther@suse.de>
 
 	* gimple-match.h (mprts_hook): Declare.
Index: gcc/config/arm/aarch-common-protos.h
===================================================================
--- gcc/config/arm/aarch-common-protos.h	(revision 228323)
+++ gcc/config/arm/aarch-common-protos.h	(working copy)
@@ -23,7 +23,9 @@
 #ifndef GCC_AARCH_COMMON_PROTOS_H
 #define GCC_AARCH_COMMON_PROTOS_H
 
+extern int aarch_accumulator_forwarding (rtx_insn *, rtx_insn *);
 extern int aarch_crypto_can_dual_issue (rtx_insn *, rtx_insn *);
+extern int aarch_forward_to_shift_is_not_shifted_reg (rtx_insn *, rtx_insn *);
 extern bool aarch_rev16_p (rtx);
 extern bool aarch_rev16_shleft_mask_imm_p (rtx, machine_mode);
 extern bool aarch_rev16_shright_mask_imm_p (rtx, machine_mode);
Index: gcc/config/arm/aarch-common.c
===================================================================
--- gcc/config/arm/aarch-common.c	(revision 228323)
+++ gcc/config/arm/aarch-common.c	(working copy)
@@ -394,6 +394,112 @@
           && !reg_overlap_mentioned_p (result, op1));
 }
 
+/* Return non-zero if the destination of PRODUCER feeds the accumulator
+   operand of an MLA-like operation.  */
+
+int
+aarch_accumulator_forwarding (rtx_insn *producer, rtx_insn *consumer)
+{
+  rtx producer_set = single_set (producer);
+  rtx consumer_set = single_set (consumer);
+
+  /* We are looking for a SET feeding a SET.  */
+  if (!producer_set || !consumer_set)
+    return 0;
+
+  rtx dest = SET_DEST (producer_set);
+  rtx mla = SET_SRC (consumer_set);
+
+  /* We're looking for a register SET.  */
+  if (!REG_P (dest))
+    return 0;
+
+  rtx accumulator;
+
+  /* Strip a zero_extend.  */
+  if (GET_CODE (mla) == ZERO_EXTEND)
+    mla = XEXP (mla, 0);
+
+  switch (GET_CODE (mla))
+    {
+    case PLUS:
+      /* Possibly an MADD.  */
+      if (GET_CODE (XEXP (mla, 0)) == MULT)
+	accumulator = XEXP (mla, 1);
+      else
+	return 0;
+      break;
+    case MINUS:
+      /* Possibly an MSUB.  */
+      if (GET_CODE (XEXP (mla, 1)) == MULT)
+	accumulator = XEXP (mla, 0);
+      else
+	return 0;
+      break;
+    case FMA:
+	{
+	  /* Possibly an FMADD/FMSUB/FNMADD/FNMSUB.  */
+	  if (REG_P (XEXP (mla, 1))
+	      && REG_P (XEXP (mla, 2))
+	      && (REG_P (XEXP (mla, 0))
+		  || GET_CODE (XEXP (mla, 0)) == NEG))
+
+	    {
+	      /* FMADD/FMSUB.  */
+	      accumulator = XEXP (mla, 2);
+	    }
+	  else if (REG_P (XEXP (mla, 1))
+		   && GET_CODE (XEXP (mla, 2)) == NEG
+		   && (REG_P (XEXP (mla, 0))
+		       || GET_CODE (XEXP (mla, 0)) == NEG))
+	    {
+	      /* FNMADD/FNMSUB.  */
+	      accumulator = XEXP (XEXP (mla, 2), 0);
+	    }
+	  else
+	    return 0;
+	  break;
+	}
+      default:
+	/* Not an MLA-like operation.  */
+	return 0;
+    }
+
+  return (REGNO (dest) == REGNO (accumulator));
+}
+
+/* Return nonzero if the CONSUMER instruction is some sort of
+   arithmetic or logic + shift operation, and the register we are
+   writing in PRODUCER is not used in a register shift by register
+   operation.  */
+
+int
+aarch_forward_to_shift_is_not_shifted_reg (rtx_insn *producer,
+					   rtx_insn *consumer)
+{
+  rtx value, op;
+  rtx early_op;
+
+  if (!arm_get_set_operands (producer, consumer, &value, &op))
+    return 0;
+
+  if ((early_op = arm_find_shift_sub_rtx (op)))
+    {
+      if (REG_P (early_op))
+	early_op = op;
+
+      /* Any other canonicalisation of a shift is a shift-by-constant
+	 so we don't care.  */
+      if (GET_CODE (early_op) == ASHIFT)
+	return (!REG_P (XEXP (early_op, 0))
+		|| !REG_P (XEXP (early_op, 1)));
+      else
+	return 1;
+    }
+
+  return 0;
+}
+
 /* Return non-zero if the consumer (a multiply-accumulate instruction)
    has an accumulator dependency on the result of the producer (a
    multiplication instruction) and no other dependency on that result.  */
Index: gcc/config/arm/cortex-a53.md
===================================================================
--- gcc/config/arm/cortex-a53.md	(revision 228323)
+++ gcc/config/arm/cortex-a53.md	(working copy)
@@ -22,345 +22,700 @@
 (define_automaton "cortex_a53")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; Functional units.
+;; General-purpose functional units.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-;; There are two main integer execution pipelines, described as
-;; slot 0 and issue slot 1.
+;; We use slot0 and slot1 to model constraints on which instructions may
+;; dual-issue.
 
 (define_cpu_unit "cortex_a53_slot0" "cortex_a53")
 (define_cpu_unit "cortex_a53_slot1" "cortex_a53")
 
-(define_reservation "cortex_a53_slot_any" "cortex_a53_slot0|cortex_a53_slot1")
-(define_reservation "cortex_a53_single_issue" "cortex_a53_slot0+cortex_a53_slot1")
+(define_reservation "cortex_a53_slot_any"
+		    "cortex_a53_slot0\
+		     |cortex_a53_slot1")
 
-;; The load/store pipeline.  Load/store instructions can dual-issue from
-;; either pipeline, but two load/stores cannot simultaneously issue.
+(define_reservation "cortex_a53_single_issue"
+		    "cortex_a53_slot0\
+		     +cortex_a53_slot1")
 
-(define_cpu_unit "cortex_a53_ls" "cortex_a53")
+;; Used to model load and store pipelines.  Load/store instructions
+;; can dual-issue with other instructions, but two load/stores cannot
+;; simultaneously issue.
 
-;; The store pipeline.  Shared between both execution pipelines.
-
 (define_cpu_unit "cortex_a53_store" "cortex_a53")
+(define_cpu_unit "cortex_a53_load" "cortex_a53")
+(define_cpu_unit "cortex_a53_ls_agen" "cortex_a53")
 
-;; The branch pipeline.  Branches can dual-issue with other instructions
-;; (except when those instructions take multiple cycles to issue).
+;; Used to model a branch pipeline.  Branches can dual-issue with other
+;; instructions (except when those instructions take multiple cycles
+;; to issue).
 
 (define_cpu_unit "cortex_a53_branch" "cortex_a53")
 
-;; The integer divider.
+;; Used to model an integer divide pipeline.
 
 (define_cpu_unit "cortex_a53_idiv" "cortex_a53")
 
-;; The floating-point add pipeline used to model the usage
-;; of the add pipeline by fmac instructions.
+;; Used to model an integer multiply/multiply-accumulate pipeline.
 
-(define_cpu_unit "cortex_a53_fpadd_pipe" "cortex_a53")
+(define_cpu_unit "cortex_a53_imul" "cortex_a53")
 
-;; Floating-point div/sqrt (long latency, out-of-order completion).
+;; Model general structural hazards, for wherever we need them.
 
-(define_cpu_unit "cortex_a53_fp_div_sqrt" "cortex_a53")
+(define_cpu_unit "cortex_a53_hazard" "cortex_a53")
 
-;; The Advanced SIMD pipelines.
-
-(define_cpu_unit "cortex_a53_simd0" "cortex_a53")
-(define_cpu_unit "cortex_a53_simd1" "cortex_a53")
-
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; ALU instructions.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_alu" 2
+(define_insn_reservation "cortex_a53_shift" 2
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "alu_imm,alus_imm,logic_imm,logics_imm,\
-                        alu_sreg,alus_sreg,logic_reg,logics_reg,\
-                        adc_imm,adcs_imm,adc_reg,adcs_reg,\
-                        adr,bfm,csel,clz,rbit,rev,alu_dsp_reg,\
-                        rotate_imm,shift_imm,shift_reg,\
-                        mov_imm,mov_reg,mvn_imm,mvn_reg,\
-                        mrs,multiple,no_insn"))
+       (eq_attr "type" "adr,shift_imm,shift_reg,mov_imm,mvn_imm"))
   "cortex_a53_slot_any")
 
-(define_insn_reservation "cortex_a53_alu_shift" 2
+(define_insn_reservation "cortex_a53_alu_rotate_imm" 2
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "alu_shift_imm,alus_shift_imm,\
-                        crc,logic_shift_imm,logics_shift_imm,\
-                        alu_ext,alus_ext,alu_shift_reg,alus_shift_reg,\
-                        logic_shift_reg,logics_shift_reg,\
-                        extend,mov_shift,mov_shift_reg,\
-                        mvn_shift,mvn_shift_reg"))
+       (eq_attr "type" "rotate_imm"))
+  "(cortex_a53_slot1)
+   | (cortex_a53_single_issue)")
+
+(define_insn_reservation "cortex_a53_alu" 3
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "type" "alu_imm,alus_imm,logic_imm,logics_imm,
+			alu_sreg,alus_sreg,logic_reg,logics_reg,
+			adc_imm,adcs_imm,adc_reg,adcs_reg,
+			bfm,csel,clz,rbit,rev,alu_dsp_reg,
+			mov_reg,mvn_reg,
+			mrs,multiple,no_insn"))
   "cortex_a53_slot_any")
 
-;; Forwarding path for unshifted operands.
+(define_insn_reservation "cortex_a53_alu_shift" 3
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "type" "alu_shift_imm,alus_shift_imm,
+			crc,logic_shift_imm,logics_shift_imm,
+			alu_ext,alus_ext,
+			extend,mov_shift,mvn_shift"))
+  "cortex_a53_slot_any")
 
-(define_bypass 1 "cortex_a53_alu,cortex_a53_alu_shift"
-  "cortex_a53_alu")
+(define_insn_reservation "cortex_a53_alu_shift_reg" 3
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "type" "alu_shift_reg,alus_shift_reg,
+			logic_shift_reg,logics_shift_reg,
+			mov_shift_reg,mvn_shift_reg"))
+  "cortex_a53_slot_any+cortex_a53_hazard")
 
-(define_bypass 1 "cortex_a53_alu,cortex_a53_alu_shift"
-  "cortex_a53_alu_shift"
-  "arm_no_early_alu_shift_dep")
-
-;; The multiplier pipeline can forward results so there's no need to specify
-;; bypasses. Multiplies can only single-issue currently.
-
 (define_insn_reservation "cortex_a53_mul" 3
   (and (eq_attr "tune" "cortexa53")
        (ior (eq_attr "mul32" "yes")
-            (eq_attr "mul64" "yes")))
-  "cortex_a53_single_issue")
+	    (eq_attr "mul64" "yes")))
+  "cortex_a53_slot_any+cortex_a53_imul")
 
-;; A multiply with a single-register result or an MLA, followed by an
-;; MLA with an accumulator dependency, has its result forwarded so two
-;; such instructions can issue back-to-back.
+;; From the perspective of the GCC scheduling state machine, if we wish to
+;; model an instruction as serialising other instructions, we are best to do
+;; so by modelling it as taking very few cycles.  Scheduling many other
+;; instructions underneath it at the cost of freedom to pick from the
+;; ready list is likely to hurt us more than it helps.  However, we do
+;; want to model some resource and latency cost for divide instructions in
+;; order to avoid divides ending up too lumpy.
 
-(define_bypass 1 "cortex_a53_mul"
-               "cortex_a53_mul"
-               "arm_mac_accumulator_is_mul_result")
-
-;; Punt with a high enough latency for divides.
-(define_insn_reservation "cortex_a53_udiv" 8
+(define_insn_reservation "cortex_a53_div" 4
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "udiv"))
-  "(cortex_a53_slot0+cortex_a53_idiv),cortex_a53_idiv*7")
+       (eq_attr "type" "udiv,sdiv"))
+  "cortex_a53_slot0,cortex_a53_idiv*2")
 
-(define_insn_reservation "cortex_a53_sdiv" 9
-  (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "sdiv"))
-  "(cortex_a53_slot0+cortex_a53_idiv),cortex_a53_idiv*8")
-
-
-(define_bypass 2 "cortex_a53_mul,cortex_a53_udiv,cortex_a53_sdiv"
-               "cortex_a53_alu")
-(define_bypass 2 "cortex_a53_mul,cortex_a53_udiv,cortex_a53_sdiv"
-               "cortex_a53_alu_shift"
-               "arm_no_early_alu_shift_dep")
-
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; Load/store instructions.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-;; Address-generation happens in the issue stage.
+;; TODO: load<n> is not prescriptive about how much data is to be loaded.
+;; This is most obvious for LDRD from AArch32 and LDP (X register) from
+;; AArch64, both are tagged load2 but LDP will load 128-bits compared to
+;; LDRD which is 64-bits.
+;;
+;; For the below, we assume AArch64 X-registers for load2, and AArch32
+;; registers for load3/load4.
 
-(define_insn_reservation "cortex_a53_load1" 3
+(define_insn_reservation "cortex_a53_load1" 4
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "load_byte,load1,load_acq"))
-  "cortex_a53_slot_any+cortex_a53_ls")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_load")
 
 (define_insn_reservation "cortex_a53_store1" 2
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "store1,store_rel"))
-  "cortex_a53_slot_any+cortex_a53_ls+cortex_a53_store")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store")
 
-(define_insn_reservation "cortex_a53_load2" 3
+;; Model AArch64-sized LDP Xm, Xn, [Xa]
+
+(define_insn_reservation "cortex_a53_load2" 4
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "load2"))
-  "cortex_a53_single_issue+cortex_a53_ls")
+  "cortex_a53_single_issue+cortex_a53_ls_agen,
+   cortex_a53_load+cortex_a53_slot0,
+   cortex_a53_load")
 
 (define_insn_reservation "cortex_a53_store2" 2
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "store2"))
-  "cortex_a53_single_issue+cortex_a53_ls+cortex_a53_store")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store")
 
-(define_insn_reservation "cortex_a53_load3plus" 4
+;; Model AArch32-sized LDM Ra, {Rm, Rn, Ro}
+
+(define_insn_reservation "cortex_a53_load3plus" 6
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "load3,load4"))
-  "(cortex_a53_single_issue+cortex_a53_ls)*2")
+  "cortex_a53_single_issue+cortex_a53_ls_agen,
+   cortex_a53_load+cortex_a53_slot0,
+   cortex_a53_load")
 
-(define_insn_reservation "cortex_a53_store3plus" 3
+(define_insn_reservation "cortex_a53_store3plus" 2
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "store3,store4"))
-  "(cortex_a53_single_issue+cortex_a53_ls+cortex_a53_store)*2")
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store+cortex_a53_slot0,
+   cortex_a53_store")
 
-;; Load/store addresses are required early in Issue.
-(define_bypass 3 "cortex_a53_load1,cortex_a53_load2,cortex_a53_load3plus,cortex_a53_alu,cortex_a53_alu_shift"
-                 "cortex_a53_load*"
-                 "arm_early_load_addr_dep")
-(define_bypass 3 "cortex_a53_load1,cortex_a53_load2,cortex_a53_load3plus,cortex_a53_alu,cortex_a53_alu_shift"
-                 "cortex_a53_store*"
-                 "arm_early_store_addr_dep")
-
-;; Load data can forward in the ALU pipeline
-(define_bypass 2 "cortex_a53_load1,cortex_a53_load2"
-               "cortex_a53_alu")
-(define_bypass 2 "cortex_a53_load1,cortex_a53_load2"
-               "cortex_a53_alu_shift"
-               "arm_no_early_alu_shift_dep")
-
-;; ALU ops can forward to stores.
-(define_bypass 0 "cortex_a53_alu,cortex_a53_alu_shift"
-                 "cortex_a53_store1,cortex_a53_store2,cortex_a53_store3plus"
-                 "arm_no_early_store_addr_dep")
-
-(define_bypass 1 "cortex_a53_mul,cortex_a53_udiv,cortex_a53_sdiv,cortex_a53_load1,cortex_a53_load2,cortex_a53_load3plus"
-                 "cortex_a53_store1,cortex_a53_store2,cortex_a53_store3plus"
-                 "arm_no_early_store_addr_dep")
-
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; Branches.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-;; Currently models all branches as dual-issuable from either execution
-;; slot, which isn't true for all cases. We still need to model indirect
-;; branches.
+;; Model all branches as dual-issuable from either execution, which
+;; is not strictly true for all cases (indirect branches).
 
 (define_insn_reservation "cortex_a53_branch" 0
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "branch,call"))
-  "cortex_a53_slot_any+cortex_a53_branch")
+  "cortex_a53_slot_any,cortex_a53_branch")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; General-purpose register bypasses
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+;; Model bypasses for unshifted operands to ALU instructions.
+
+(define_bypass 1 "cortex_a53_shift"
+		 "cortex_a53_shift")
+
+(define_bypass 1 "cortex_a53_alu,
+		  cortex_a53_alu_shift*,
+		  cortex_a53_alu_rotate_imm,
+		  cortex_a53_shift"
+		 "cortex_a53_alu")
+
+(define_bypass 2 "cortex_a53_alu,
+		  cortex_a53_alu_shift*"
+		 "cortex_a53_alu_shift*"
+		 "aarch_forward_to_shift_is_not_shifted_reg")
+
+;; In our model, we allow any general-purpose register operation to
+;; bypass to the accumulator operand of an integer MADD-like operation.
+
+(define_bypass 1 "cortex_a53_alu*,
+		  cortex_a53_load*,
+		  cortex_a53_mul"
+		 "cortex_a53_mul"
+		 "aarch_accumulator_forwarding")
+
+;; Model a bypass from MLA/MUL to many ALU instructions.
+
+(define_bypass 2 "cortex_a53_mul"
+		 "cortex_a53_alu,
+		  cortex_a53_alu_shift*")
+
+;; We get neater schedules by allowing an MLA/MUL to feed an
+;; early load address dependency to a load.
+
+(define_bypass 2 "cortex_a53_mul"
+		 "cortex_a53_load*"
+		 "arm_early_load_addr_dep")
+
+;; Model bypasses for loads which are to be consumed by the ALU.
+
+(define_bypass 2 "cortex_a53_load1"
+		 "cortex_a53_alu")
+
+(define_bypass 3 "cortex_a53_load1"
+		 "cortex_a53_alu_shift*")
+
+;; Model a bypass for ALU instructions feeding stores.
+
+(define_bypass 1 "cortex_a53_alu*"
+		 "cortex_a53_store1,
+		  cortex_a53_store2,
+		  cortex_a53_store3plus"
+		 "arm_no_early_store_addr_dep")
+
+;; Model a bypass for load and multiply instructions feeding stores.
+
+(define_bypass 2 "cortex_a53_mul,
+		  cortex_a53_load1,
+		  cortex_a53_load2,
+		  cortex_a53_load3plus"
+		 "cortex_a53_store1,
+		  cortex_a53_store2,
+		  cortex_a53_store3plus"
+		 "arm_no_early_store_addr_dep")
+
+;; Model a GP->FP register move as similar to stores.
+
+(define_bypass 1 "cortex_a53_alu*"
+		 "cortex_a53_r2f")
+
+(define_bypass 2 "cortex_a53_mul,
+		  cortex_a53_load1,
+		  cortex_a53_load2,
+		  cortex_a53_load3plus"
+		 "cortex_a53_r2f")
+
+;; Shifts feeding Load/Store addresses may not be ready in time.
+
+(define_bypass 3 "cortex_a53_shift"
+		 "cortex_a53_load*"
+		 "arm_early_load_addr_dep")
+
+(define_bypass 3 "cortex_a53_shift"
+		 "cortex_a53_store*"
+		 "arm_early_store_addr_dep")
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point/Advanced SIMD.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(define_automaton "cortex_a53_advsimd")
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Broad Advanced SIMD type categorisation
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(define_attr "cortex_a53_advsimd_type"
+  "advsimd_alu, advsimd_alu_q,
+   advsimd_mul, advsimd_mul_q,
+   advsimd_div_s, advsimd_div_s_q,
+   advsimd_div_d, advsimd_div_d_q,
+   advsimd_load_64, advsimd_store_64,
+   advsimd_load_128, advsimd_store_128,
+   advsimd_load_lots, advsimd_store_lots,
+   unknown"
+  (cond [
+    (eq_attr "type" "neon_add, neon_qadd, neon_add_halve, neon_sub, neon_qsub,\
+		     neon_sub_halve, neon_abs, neon_neg, neon_qneg,\
+		     neon_qabs, neon_abd, neon_minmax, neon_compare,\
+		     neon_compare_zero, neon_arith_acc, neon_reduc_add,\
+		     neon_reduc_add_acc, neon_reduc_minmax,\
+		     neon_logic, neon_tst, neon_shift_imm,\
+		     neon_shift_reg, neon_shift_acc, neon_sat_shift_imm,\
+		     neon_sat_shift_reg, neon_ins, neon_move,\
+		     neon_permute, neon_zip, neon_tbl1,\
+		     neon_tbl2, neon_tbl3, neon_tbl4, neon_bsl,\
+		     neon_cls, neon_cnt, neon_dup,\
+		     neon_ext, neon_rbit, neon_rev,\
+		     neon_fp_abd_s, neon_fp_abd_d,\
+		     neon_fp_abs_s, neon_fp_abs_d,\
+		     neon_fp_addsub_s, neon_fp_addsub_d, neon_fp_compare_s,\
+		     neon_fp_compare_d, neon_fp_minmax_s,\
+		     neon_fp_minmax_d, neon_fp_neg_s, neon_fp_neg_d,\
+		     neon_fp_reduc_add_s, neon_fp_reduc_add_d,\
+		     neon_fp_reduc_minmax_s, neon_fp_reduc_minmax_d,\
+		     neon_fp_cvt_widen_h, neon_fp_to_int_s,neon_fp_to_int_d,\
+		     neon_int_to_fp_s, neon_int_to_fp_d, neon_fp_round_s,\
+		     neon_fp_recpe_s, neon_fp_recpe_d, neon_fp_recps_s,\
+		     neon_fp_recps_d, neon_fp_recpx_s, neon_fp_recpx_d,\
+		     neon_fp_rsqrte_s, neon_fp_rsqrte_d, neon_fp_rsqrts_s,\
+		     neon_fp_rsqrts_d")
+      (const_string "advsimd_alu")
+    (eq_attr "type" "neon_add_q, neon_add_widen, neon_add_long,\
+		     neon_qadd_q, neon_add_halve_q, neon_add_halve_narrow_q,\
+		     neon_sub_q, neon_sub_widen, neon_sub_long,\
+		     neon_qsub_q, neon_sub_halve_q, neon_sub_halve_narrow_q,\
+		     neon_abs_q, neon_neg_q, neon_qneg_q, neon_qabs_q,\
+		     neon_abd_q, neon_abd_long, neon_minmax_q,\
+		     neon_compare_q, neon_compare_zero_q,\
+		     neon_arith_acc_q, neon_reduc_add_q,\
+		     neon_reduc_add_long, neon_reduc_add_acc_q,\
+		     neon_reduc_minmax_q, neon_logic_q, neon_tst_q,\
+		     neon_shift_imm_q, neon_shift_imm_narrow_q,\
+		     neon_shift_imm_long, neon_shift_reg_q,\
+		     neon_shift_acc_q, neon_sat_shift_imm_q,\
+		     neon_sat_shift_imm_narrow_q, neon_sat_shift_reg_q,\
+		     neon_ins_q, neon_move_q, neon_move_narrow_q,\
+		     neon_permute_q, neon_zip_q,\
+		     neon_tbl1_q, neon_tbl2_q, neon_tbl3_q,\
+		     neon_tbl4_q, neon_bsl_q, neon_cls_q, neon_cnt_q,\
+		     neon_dup_q, neon_ext_q, neon_rbit_q,\
+		     neon_rev_q, neon_fp_abd_s_q, neon_fp_abd_d_q,\
+		     neon_fp_abs_s_q, neon_fp_abs_d_q,\
+		     neon_fp_addsub_s_q, neon_fp_addsub_d_q,\
+		     neon_fp_compare_s_q, neon_fp_compare_d_q,\
+		     neon_fp_minmax_s_q, neon_fp_minmax_d_q,\
+		     neon_fp_cvt_widen_s, neon_fp_neg_s_q, neon_fp_neg_d_q,\
+		     neon_fp_reduc_add_s_q, neon_fp_reduc_add_d_q,\
+		     neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d_q,\
+		     neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
+		     neon_fp_to_int_s_q, neon_fp_to_int_d_q,\
+		     neon_int_to_fp_s_q, neon_int_to_fp_d_q,\
+		     neon_fp_round_s_q,\
+		     neon_fp_recpe_s_q, neon_fp_recpe_d_q,\
+		     neon_fp_recps_s_q, neon_fp_recps_d_q,\
+		     neon_fp_recpx_s_q, neon_fp_recpx_d_q,\
+		     neon_fp_rsqrte_s_q, neon_fp_rsqrte_d_q,\
+		     neon_fp_rsqrts_s_q, neon_fp_rsqrts_d_q")
+      (const_string "advsimd_alu_q")
+    (eq_attr "type" "neon_mul_b, neon_mul_h, neon_mul_s,\
+		     neon_mul_h_scalar, neon_mul_s_scalar,\
+		     neon_sat_mul_b, neon_sat_mul_h, neon_sat_mul_s,\
+		     neon_sat_mul_h_scalar, neon_sat_mul_s_scalar,\
+		     neon_mla_b, neon_mla_h, neon_mla_s,\
+		     neon_mla_h_scalar, neon_mla_s_scalar,\
+		     neon_fp_mul_s, neon_fp_mul_s_scalar,\
+		     neon_fp_mul_d, neon_fp_mla_s,\
+		     neon_fp_mla_s_scalar, neon_fp_mla_d")
+      (const_string "advsimd_mul")
+    (eq_attr "type" "neon_mul_b_q, neon_mul_h_q, neon_mul_s_q,\
+		     neon_mul_b_long, neon_mul_h_long, neon_mul_s_long,\
+		     neon_mul_d_long, neon_mul_h_scalar_q,\
+		     neon_mul_s_scalar_q, neon_mul_h_scalar_long,\
+		     neon_mul_s_scalar_long, neon_sat_mul_b_q,\
+		     neon_sat_mul_h_q, neon_sat_mul_s_q,\
+		     neon_sat_mul_b_long, neon_sat_mul_h_long,\
+		     neon_sat_mul_s_long, neon_sat_mul_h_scalar_q,\
+		     neon_sat_mul_s_scalar_q, neon_sat_mul_h_scalar_long,\
+		     neon_sat_mul_s_scalar_long, neon_mla_b_q,\
+		     neon_mla_h_q, neon_mla_s_q, neon_mla_b_long,\
+		     neon_mla_h_long, neon_mla_s_long,\
+		     neon_mla_h_scalar_q, neon_mla_s_scalar_q,\
+		     neon_mla_h_scalar_long, neon_mla_s_scalar_long,\
+		     neon_sat_mla_b_long, neon_sat_mla_h_long,\
+		     neon_sat_mla_s_long, neon_sat_mla_h_scalar_long,\
+		     neon_sat_mla_s_scalar_long,\
+		     neon_fp_mul_s_q, neon_fp_mul_s_scalar_q,\
+		     neon_fp_mul_d_q, neon_fp_mul_d_scalar_q,\
+		     neon_fp_mla_s_q, neon_fp_mla_s_scalar_q,\
+		     neon_fp_mla_d_q, neon_fp_mla_d_scalar_q")
+      (const_string "advsimd_mul_q")
+    (eq_attr "type" "neon_fp_sqrt_s, neon_fp_div_s")
+      (const_string "advsimd_div_s")
+    (eq_attr "type" "neon_fp_sqrt_s_q, neon_fp_div_s_q")
+      (const_string "advsimd_div_s_q")
+    (eq_attr "type" "neon_fp_sqrt_d, neon_fp_div_d")
+      (const_string "advsimd_div_d")
+    (eq_attr "type" "neon_fp_sqrt_d_q, neon_fp_div_d_q")
+      (const_string "advsimd_div_d_q")
+    (eq_attr "type" "neon_ldr, neon_load1_1reg,\
+		     neon_load1_all_lanes, neon_load1_all_lanes_q,\
+		     neon_load1_one_lane, neon_load1_one_lane_q")
+      (const_string "advsimd_load_64")
+    (eq_attr "type" "neon_str, neon_store1_1reg,\
+		     neon_store1_one_lane,neon_store1_one_lane_q")
+      (const_string "advsimd_store_64")
+    (eq_attr "type" "neon_load1_1reg_q, neon_load1_2reg,\
+		     neon_load2_2reg,\
+		     neon_load2_all_lanes, neon_load2_all_lanes_q,\
+		     neon_load2_one_lane, neon_load2_one_lane_q")
+      (const_string "advsimd_load_128")
+    (eq_attr "type" "neon_store1_1reg_q, neon_store1_2reg,\
+		     neon_store2_2reg,\
+		     neon_store2_one_lane, neon_store2_one_lane_q")
+      (const_string "advsimd_store_128")
+    (eq_attr "type" "neon_load1_2reg_q, neon_load1_3reg, neon_load1_3reg_q,\
+		     neon_load1_4reg, neon_load1_4reg_q, \
+		     neon_load2_2reg_q, neon_load2_4reg,\
+		     neon_load2_4reg_q, neon_load3_3reg,\
+		     neon_load3_3reg_q, neon_load3_all_lanes,\
+		     neon_load3_all_lanes_q, neon_load3_one_lane,\
+		     neon_load3_one_lane_q, neon_load4_4reg,\
+		     neon_load4_4reg_q, neon_load4_all_lanes,\
+		     neon_load4_all_lanes_q, neon_load4_one_lane,\
+		     neon_load4_one_lane_q, neon_ldp, neon_ldp_q")
+      (const_string "advsimd_load_lots")
+    (eq_attr "type" "neon_store1_2reg_q, neon_store1_3reg,\
+		     neon_store1_3reg_q, neon_store1_4reg,\
+		     neon_store1_4reg_q, neon_store2_2reg_q,\
+		     neon_store2_4reg, neon_store2_4reg_q,\
+		     neon_store3_3reg, neon_store3_3reg_q,\
+		     neon_store3_one_lane, neon_store3_one_lane_q,\
+		     neon_store4_4reg, neon_store4_4reg_q,\
+		     neon_store4_one_lane, neon_store4_one_lane_q,\
+		     neon_stp, neon_stp_q")
+      (const_string "advsimd_store_lots")]
+      (const_string "unknown")))
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point/Advanced SIMD functional units.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+;; We model the Advanced SIMD unit as two 64-bit units, each with three
+;; pipes, FP_ALU, FP_MUL, FP_DIV.  We also give convenient reservations
+;; for 128-bit Advanced SIMD instructions, which use both units.
+
+;; The floating-point/Advanced SIMD ALU pipelines.
+
+(define_cpu_unit "cortex_a53_fp_alu_lo,\
+		  cortex_a53_fp_alu_hi"
+		 "cortex_a53_advsimd")
+
+(define_reservation "cortex_a53_fp_alu"
+		    "cortex_a53_fp_alu_lo\
+		     |cortex_a53_fp_alu_hi")
+
+(define_reservation "cortex_a53_fp_alu_q"
+		    "cortex_a53_fp_alu_lo\
+		     +cortex_a53_fp_alu_hi")
+
+;; The floating-point/Advanced SIMD multiply/multiply-accumulate
+;; pipelines.
+
+(define_cpu_unit "cortex_a53_fp_mul_lo,\
+		  cortex_a53_fp_mul_hi"
+		 "cortex_a53_advsimd")
+
+(define_reservation "cortex_a53_fp_mul"
+		    "cortex_a53_fp_mul_lo\
+		     |cortex_a53_fp_mul_hi")
+
+(define_reservation "cortex_a53_fp_mul_q"
+		    "cortex_a53_fp_mul_lo\
+		     +cortex_a53_fp_mul_hi")
+
+;; Floating-point/Advanced SIMD divide/square root.
+
+(define_cpu_unit "cortex_a53_fp_div_lo,\
+		  cortex_a53_fp_div_hi"
+		 "cortex_a53_advsimd")
+
+;; Once we choose a pipe, stick with it for three simulated cycles.
+
+(define_reservation "cortex_a53_fp_div"
+		    "(cortex_a53_fp_div_lo*3)\
+		     |(cortex_a53_fp_div_hi*3)")
+
+(define_reservation "cortex_a53_fp_div_q"
+		    "(cortex_a53_fp_div_lo*3)\
+		     +(cortex_a53_fp_div_hi*3)")
+
+;; Cryptographic extensions
+
+(define_cpu_unit "cortex_a53_crypto"
+		 "cortex_a53_advsimd")
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; Floating-point arithmetic.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_fpalu" 4
+(define_insn_reservation "cortex_a53_fpalu" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "ffariths, fadds, ffarithd, faddd, fmov, fmuls,\
-                        f_cvt,f_cvtf2i,f_cvti2f,\
-                        fcmps, fcmpd, fcsel, f_rints, f_rintd, f_minmaxs,\
-                        f_minmaxd"))
-  "cortex_a53_slot0+cortex_a53_fpadd_pipe")
+	(eq_attr "type" "ffariths, fadds, ffarithd, faddd, fmov,
+			f_cvt, fcmps, fcmpd, fcsel, f_rints, f_rintd,
+			f_minmaxs, f_minmaxd"))
+  "cortex_a53_slot_any,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_fconst" 2
+(define_insn_reservation "cortex_a53_fconst" 3
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "fconsts,fconstd"))
-  "cortex_a53_slot0+cortex_a53_fpadd_pipe")
+  "cortex_a53_slot_any,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_fpmul" 4
+(define_insn_reservation "cortex_a53_fpmul" 5
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "fmuls,fmuld"))
-  "cortex_a53_slot0")
+  "cortex_a53_slot_any,cortex_a53_fp_mul")
 
-;; For single-precision multiply-accumulate, the add (accumulate) is issued after
-;; the multiply completes. Model that accordingly.
+;; For multiply-accumulate, model the add (accumulate) as being issued
+;; after the multiply completes.
 
 (define_insn_reservation "cortex_a53_fpmac" 8
   (and (eq_attr "tune" "cortexa53")
        (eq_attr "type" "fmacs,fmacd,ffmas,ffmad"))
-  "cortex_a53_slot0, nothing*3, cortex_a53_fpadd_pipe")
+  "cortex_a53_slot_any,cortex_a53_fp_mul,
+   nothing*3, cortex_a53_fp_alu")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; Floating-point divide/square root instructions.
+;; Floating-point to/from core transfers.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; fsqrt really takes one cycle less, but that is not modelled.
 
-(define_insn_reservation "cortex_a53_fdivs" 14
+(define_insn_reservation "cortex_a53_r2f" 6
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "fdivs, fsqrts"))
-  "cortex_a53_slot0, cortex_a53_fp_div_sqrt * 5")
+       (eq_attr "type" "f_mcr,f_mcrr,f_cvti2f,
+			neon_from_gp, neon_from_gp_q"))
+  "cortex_a53_slot_any,cortex_a53_store,
+   nothing,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_fdivd" 29
+(define_insn_reservation "cortex_a53_f2r" 6
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "fdivd, fsqrtd"))
-  "cortex_a53_slot0, cortex_a53_fp_div_sqrt * 8")
+       (eq_attr "type" "f_mrc,f_mrrc,f_cvtf2i,
+			neon_to_gp, neon_to_gp_q"))
+  "cortex_a53_slot_any,cortex_a53_fp_alu,
+   nothing,cortex_a53_store")
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; ARMv8-A Cryptographic extensions.
+;; Floating-point flag transfer.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_crypto_aese" 2
+(define_insn_reservation "cortex_a53_f_flags" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_aese"))
-  "cortex_a53_simd0")
+       (eq_attr "type" "f_flag"))
+  "cortex_a53_slot_any")
 
-(define_insn_reservation "cortex_a53_crypto_aesmc" 2
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point load/store.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(define_insn_reservation "cortex_a53_f_load_64" 4
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_aesmc"))
-  "cortex_a53_simd0 | cortex_a53_simd1")
+       (ior (eq_attr "type" "f_loads,f_loadd")
+	    (eq_attr "cortex_a53_advsimd_type"
+		     "advsimd_load_64")))
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_load")
 
-(define_insn_reservation "cortex_a53_crypto_sha1_fast" 2
+(define_insn_reservation "cortex_a53_f_load_many" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_sha1_fast, crypto_sha256_fast"))
-  "cortex_a53_simd0")
+       (eq_attr "cortex_a53_advsimd_type"
+		"advsimd_load_128,advsimd_load_lots"))
+  "cortex_a53_single_issue+cortex_a53_ls_agen,
+   cortex_a53_load+cortex_a53_slot0,
+   cortex_a53_load")
 
-(define_insn_reservation "cortex_a53_crypto_sha1_xor" 3
+(define_insn_reservation "cortex_a53_f_store_64" 0
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_sha1_xor"))
-  "cortex_a53_simd0")
+       (ior (eq_attr "type" "f_stores,f_stored")
+	    (eq_attr "cortex_a53_advsimd_type"
+		     "advsimd_store_64")))
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store")
 
-(define_insn_reservation "cortex_a53_crypto_sha_slow" 5
+(define_insn_reservation "cortex_a53_f_store_many" 0
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "crypto_sha1_slow, crypto_sha256_slow"))
-  "cortex_a53_simd0")
+       (eq_attr "cortex_a53_advsimd_type"
+		"advsimd_store_128,advsimd_store_lots"))
+  "cortex_a53_slot_any+cortex_a53_ls_agen,
+   cortex_a53_store+cortex_a53_slot0,
+   cortex_a53_store")
 
-(define_bypass 0 "cortex_a53_crypto_aese"
-                 "cortex_a53_crypto_aesmc"
-                 "aarch_crypto_can_dual_issue")
-
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; VFP to/from core transfers.
+;; Advanced SIMD.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_r2f" 4
+;; Either we want to model use of the ALU pipe, the multiply pipe or the
+;; divide/sqrt pipe.  In all cases we need to check if we are a 64-bit
+;; operation (in which case we model dual-issue without penalty)
+;; or a 128-bit operation in which case we require in our model that we
+;; issue from slot 0.
+
+(define_insn_reservation "cortex_a53_advsimd_alu" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_mcr,f_mcrr"))
-  "cortex_a53_slot0")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_alu"))
+  "cortex_a53_slot_any,cortex_a53_fp_alu")
 
-(define_insn_reservation "cortex_a53_f2r" 2
+(define_insn_reservation "cortex_a53_advsimd_alu_q" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_mrc,f_mrrc"))
-  "cortex_a53_slot0")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_alu_q"))
+  "cortex_a53_slot0,cortex_a53_fp_alu_q")
 
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; VFP flag transfer.
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+(define_insn_reservation "cortex_a53_advsimd_mul" 5
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_mul"))
+  "cortex_a53_slot_any,cortex_a53_fp_mul")
 
-(define_insn_reservation "cortex_a53_f_flags" 4
+(define_insn_reservation "cortex_a53_advsimd_mul_q" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_flag"))
-  "cortex_a53_slot0")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_mul_q"))
+  "cortex_a53_slot0,cortex_a53_fp_mul_q")
 
+;; SIMD Dividers.
+
+(define_insn_reservation "cortex_a53_advsimd_div_s" 14
+  (and (eq_attr "tune" "cortexa53")
+       (ior (eq_attr "type" "fdivs,fsqrts")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_div_s")))
+  "cortex_a53_slot0,cortex_a53_fp_mul,
+   cortex_a53_fp_div")
+
+(define_insn_reservation "cortex_a53_advsimd_div_d" 29
+  (and (eq_attr "tune" "cortexa53")
+       (ior (eq_attr "type" "fdivd,fsqrtd")
+	    (eq_attr "cortex_a53_advsimd_type" "advsimd_div_d")))
+  "cortex_a53_slot0,cortex_a53_fp_mul,
+   cortex_a53_fp_div")
+
+(define_insn_reservation "cortex_a53_advsimd_div_s_q" 14
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_div_s_q"))
+  "cortex_a53_single_issue,cortex_a53_fp_mul_q,
+   cortex_a53_fp_div_q")
+
+(define_insn_reservation "cortex_a53_advsimd_divd_q" 29
+  (and (eq_attr "tune" "cortexa53")
+       (eq_attr "cortex_a53_advsimd_type" "advsimd_div_d_q"))
+  "cortex_a53_single_issue,cortex_a53_fp_mul_q,
+   cortex_a53_fp_div_q")
+
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; VFP load/store.
+;; ARMv8-A Cryptographic extensions.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_insn_reservation "cortex_a53_f_loads" 4
+;; We want AESE and AESMC to end up consecutive to one another.
+
+(define_insn_reservation "cortex_a53_crypto_aese" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_loads"))
+       (eq_attr "type" "crypto_aese"))
   "cortex_a53_slot0")
 
-(define_insn_reservation "cortex_a53_f_loadd" 5
+(define_insn_reservation "cortex_a53_crypto_aesmc" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_loadd"))
-  "cortex_a53_slot0")
+       (eq_attr "type" "crypto_aesmc"))
+  "cortex_a53_slot_any")
 
-(define_insn_reservation "cortex_a53_f_load_2reg" 5
+;; SHA1H
+
+(define_insn_reservation "cortex_a53_crypto_sha1_fast" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "neon_ldp, neon_ldp_q, neon_load2_2reg_q"))
-  "(cortex_a53_slot_any+cortex_a53_ls)*2")
+       (eq_attr "type" "crypto_sha1_fast"))
+  "cortex_a53_slot_any,cortex_a53_crypto")
 
-(define_insn_reservation "cortex_a53_f_loadq" 5
+(define_insn_reservation "cortex_a53_crypto_sha256_fast" 3
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "neon_load1_1reg_q"))
-  "cortex_a53_slot_any+cortex_a53_ls")
+       (eq_attr "type" "crypto_sha256_fast"))
+  "cortex_a53_slot0,cortex_a53_crypto")
 
-(define_insn_reservation "cortex_a53_f_stores" 0
+(define_insn_reservation "cortex_a53_crypto_sha1_xor" 4
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_stores"))
-  "cortex_a53_slot0")
+       (eq_attr "type" "crypto_sha1_xor"))
+  "cortex_a53_slot0,cortex_a53_crypto")
 
-(define_insn_reservation "cortex_a53_f_stored" 0
+(define_insn_reservation "cortex_a53_crypto_sha_slow" 5
   (and (eq_attr "tune" "cortexa53")
-       (eq_attr "type" "f_stored"))
-  "cortex_a53_slot0")
+       (eq_attr "type" "crypto_sha1_slow, crypto_sha256_slow"))
+  "cortex_a53_slot0,cortex_a53_crypto")
 
-;; Load-to-use for floating-point values has a penalty of one cycle,
-;; i.e. a latency of two.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Floating-point/Advanced SIMD register bypasses.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
-(define_bypass 2 "cortex_a53_f_loads"
-                 "cortex_a53_fpalu, cortex_a53_fpmac, cortex_a53_fpmul,\
-		  cortex_a53_fdivs, cortex_a53_fdivd,\
-		  cortex_a53_f2r")
+;; Model the late use of the accumulator operand for floating-point
+;; multiply-accumulate operations as a bypass reducing the latency
+;; of producing instructions to near zero.
 
-(define_bypass 2 "cortex_a53_f_loadd"
-                 "cortex_a53_fpalu, cortex_a53_fpmac, cortex_a53_fpmul,\
-		  cortex_a53_fdivs, cortex_a53_fdivd,\
-		  cortex_a53_f2r")
+(define_bypass 1 "cortex_a53_fp*,
+		  cortex_a53_r2f,
+		  cortex_a53_f_load*"
+		 "cortex_a53_fpmac"
+		 "aarch_accumulator_forwarding")
 
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;; Crude Advanced SIMD approximation.
-;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Model a bypass from the result of an FP operation to a use.
 
-(define_insn_reservation "cortex_a53_advsimd" 4
-  (and (eq_attr "tune" "cortexa53")
-       (eq_attr "is_neon_type" "yes"))
-  "cortex_a53_simd0")
+(define_bypass 4 "cortex_a53_fpalu,
+		  cortex_a53_fpmul"
+		 "cortex_a53_fpalu,
+		  cortex_a53_fpmul,
+		  cortex_a53_fpmac,
+		  cortex_a53_advsimd_div*")
+
+;; We want AESE and AESMC to end up consecutive to one another.
+
+(define_bypass 0 "cortex_a53_crypto_aese"
+		 "cortex_a53_crypto_aesmc"
+		 "aarch_crypto_can_dual_issue")
+

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model
  2015-10-01  9:41     ` James Greenhalgh
@ 2015-10-05 10:07       ` Christophe Lyon
  2015-10-05 10:36         ` James Greenhalgh
  0 siblings, 1 reply; 11+ messages in thread
From: Christophe Lyon @ 2015-10-05 10:07 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Marcus Shawcroft, gcc-patches, Richard Earnshaw, Kyrylo Tkachov,
	Ramana Radhakrishnan

On 1 October 2015 at 11:41, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Thu, Oct 01, 2015 at 09:33:07AM +0100, Marcus Shawcroft wrote:
>> On 25/09/15 08:59, James Greenhalgh wrote:
>> >
>> > Hi,
>> >
>> > This patch introduces a new scheduling model for Cortex-A53.
>> >
>> > Bootstrapped and tested on arm-none-linux-gnueabi and aarch64-none-linux-gnu
>> > and checked with a variety of popular benchmarking and microbenchmarking
>> > suites to show a benefit.
>> >
>> > OK?
>> >
>> > Thanks,
>> > James
>> >
>> > ---
>> > 2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>
>> >
>> >     * config/arm/aarch-common-protos.h
>> >     (aarch_accumulator_forwarding): New.
>> >     (aarch_forward_to_shift_is_not_shifted_reg): Likewise.
>> >     * config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
>> >     (aarch_forward_to_shift_is_not_shifted_reg): Liekwise.
>> >     * config/arm/cortex-a53.md: Rewrite.
>> >
>>
>> OK aarch64 with Kyrill's comments fixed.
>> /M
>
> Thanks,
>
> I had to rebase this over Evandro's recent patch adding neon_ldp/neon_ldp_q
> types to the old scheduling model. The rebase was obvious to resolve, and
> while I was there I also added the neon_stp/neon_stp_q types which were
> missing.
>
> I've attached what I ultimately committed as revision 228324. I messed up
> fixing the ChangeLog typo before commit, so that is revision 228325.
>

Hi James,

Since this commit I can see
gcc.target/aarch64/advsimd-intrinsics/vst1_lane.c fail at -O2
/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst1_lane.c:
In function 'exec_vst1_lane':
/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst1_lane.c:96:1:
internal compiler error: output_operand: invalid %-code
0x78f79e output_operand_lossage(char const*, ...)
        /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:3417
0x7934f3 output_asm_insn(char const*, rtx_def**)
        /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:3782
0x793d77 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
        /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:3029
0x794b3a final(rtx_insn*, _IO_FILE*, int)
        /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:2058
0x7956fb rest_of_handle_final
        /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:4449
0x7956fb execute
        /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:4524
Please submit a full bug report,

on aarch64_be.

I haven't looked at it in more details though.


> Thanks,
> James
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model
  2015-10-05 10:07       ` Christophe Lyon
@ 2015-10-05 10:36         ` James Greenhalgh
  0 siblings, 0 replies; 11+ messages in thread
From: James Greenhalgh @ 2015-10-05 10:36 UTC (permalink / raw)
  To: Christophe Lyon
  Cc: Marcus Shawcroft, gcc-patches, Richard Earnshaw, Kyrylo Tkachov,
	Ramana Radhakrishnan, kugan.vivekanandarajah

On Mon, Oct 05, 2015 at 11:07:45AM +0100, Christophe Lyon wrote:
> On 1 October 2015 at 11:41, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> > On Thu, Oct 01, 2015 at 09:33:07AM +0100, Marcus Shawcroft wrote:
> >> On 25/09/15 08:59, James Greenhalgh wrote:
> >> >
> >> > Hi,
> >> >
> >> > This patch introduces a new scheduling model for Cortex-A53.
> >> >
> >> > Bootstrapped and tested on arm-none-linux-gnueabi and aarch64-none-linux-gnu
> >> > and checked with a variety of popular benchmarking and microbenchmarking
> >> > suites to show a benefit.
> >> >
> >> > OK?
> >> >
> >> > Thanks,
> >> > James
> >> >
> >> > ---
> >> > 2015-09-25  James Greenhalgh  <james.greenhalgh@arm.com>
> >> >
> >> >     * config/arm/aarch-common-protos.h
> >> >     (aarch_accumulator_forwarding): New.
> >> >     (aarch_forward_to_shift_is_not_shifted_reg): Likewise.
> >> >     * config/arm/aarch-common.c (aarch_accumulator_forwarding): New.
> >> >     (aarch_forward_to_shift_is_not_shifted_reg): Liekwise.
> >> >     * config/arm/cortex-a53.md: Rewrite.
> >> >
> >>
> >> OK aarch64 with Kyrill's comments fixed.
> >> /M
> >
> > Thanks,
> >
> > I had to rebase this over Evandro's recent patch adding neon_ldp/neon_ldp_q
> > types to the old scheduling model. The rebase was obvious to resolve, and
> > while I was there I also added the neon_stp/neon_stp_q types which were
> > missing.
> >
> > I've attached what I ultimately committed as revision 228324. I messed up
> > fixing the ChangeLog typo before commit, so that is revision 228325.
> >
> 
> Hi James,
> 
> Since this commit I can see
> gcc.target/aarch64/advsimd-intrinsics/vst1_lane.c fail at -O2
> /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst1_lane.c:
> In function 'exec_vst1_lane':
> /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst1_lane.c:96:1:
> internal compiler error: output_operand: invalid %-code
> 0x78f79e output_operand_lossage(char const*, ...)
>         /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:3417
> 0x7934f3 output_asm_insn(char const*, rtx_def**)
>         /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:3782
> 0x793d77 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
>         /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:3029
> 0x794b3a final(rtx_insn*, _IO_FILE*, int)
>         /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:2058
> 0x7956fb rest_of_handle_final
>         /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:4449
> 0x7956fb execute
>         /aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/final.c:4524
> Please submit a full bug report,
> 
> on aarch64_be.
> 
> I haven't looked at it in more details though.

Hi Christophe,

Thanks for the report, I'd be surprised if that was to do with the
scheduling model. I can reproduce the failure, and expect that Kugan's
patch at https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00129.html ought
to do the job of fixing the ICE.

Thanks,
James

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-10-05 10:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-25  8:20 [Patch 0/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
2015-09-25  8:30 ` [Patch 1/2 AArch64/ARM] Give AArch64 ROR (Immediate) a new type attribute James Greenhalgh
2015-09-25 13:19   ` Kyrill Tkachov
2015-09-25 13:49     ` James Greenhalgh
2015-09-25 15:05       ` Marcus Shawcroft
2015-09-25  8:37 ` [Patch 2/2 ARM/AArch64] Add a new Cortex-A53 scheduling model James Greenhalgh
2015-10-01  8:31   ` Kyrill Tkachov
2015-10-01  8:32   ` Marcus Shawcroft
2015-10-01  9:41     ` James Greenhalgh
2015-10-05 10:07       ` Christophe Lyon
2015-10-05 10:36         ` James Greenhalgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).