[PATCH, ARM] Cortex-A8 backend fixes

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH, ARM] Cortex-A8 backend fixes
@ 2012-02-09 15:09 Dmitry Melnik
  0 siblings, 0 replies; only message in thread
From: Dmitry Melnik @ 2012-02-09 15:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 3074 bytes --]

This patch fixes few things in pipeline description of ARM Cortex-A8.

1) arm_no_early_alu_shift_value_dep() checks early dependence only for 
one argument, ignoring the dependence on register used as shift amount. 
For example, this function is used as a condition in bypass that sets 
dep_cost=0 between mov and ALU operations:

   mov r0, r1
   add r3, r4, r5, asr r0

This results in dep_cost returning 0 for these insns, while according
to Technical Reference Manual it should be 1
(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babcagee.html). 


Also, in PLUS and MINUS rtx expressions the order of operands is 
different: PLUS has shift expression as its first argument, while MINUS 
usually has shift as a second argument. But in 
arm_no_early_alu_shift_value_dep() only the first argument is checked as 
EARLY_OP. We changed arm_no_early_alu_shift_dep() so it uses 
rtx_search() to find SHIFT expression.  As all registers for SHIFT 
expression are required at stage E1, it's no difference whether it's 
shift's first or second argument, so we use new 
arm_no_early_alu_shift_dep() instead of 
arm_no_early_alu_shift_value_dep() in Cortex-A8 bypasses. Functions 
arm_no_early_alu_shift_[value_]dep() are also used in Cortex-A5, 
Cortex-R4 and ARM1136JFS descriptions, so we named modified function as  
arm_cortex_a8_no_early_alu_shift_dep().
Besides SHIFTs and ROTATE, the function also handles MULT (which is used 
to represent shifts by a constant) and ZERO_EXTEND and SIGN_EXTEND (they 
also have type of alu_shift).

2) MUL to ALU bypass has incorrect delay of 4 cycles, while according to 
TRM it has to be 5 for MUL and 6 for MULL.  The patch splits this bypass 
in two and sets the correct delay values.

3) In cortex-a8.md MOV with shift instructions matched to wrong 
reservations (cortex_a8_alu_shift, cortex_a8_alu_shift_reg).  Adding 
insn attribute "mov" for arm_shiftsi3 pattern in arm.md fixes that.

4) SMLALxy was moved from cortex_a8_mull reservation to 
cortex_a8_smlald, which according to TRM has proper timing for this insn 
(1 cycle less than MULL).

5) ARM Cortex-A8 TRM itself contains inaccurate timings for availability 
of RdLo in some multiply instructions.  Namely, lower part of the result 
for (S|U)MULL, (S|U)MLAL, UMAAL, SMLALxy, SMLALD, SMLSLD instructions  
is already available at E4 stage (instead of E5 in TRM).

This information initially was found in beagle board mailing list, and 
it's confirmed by our tests and these sites: 
http://www.avison.me.uk/ben/programming/cortex-a8.html and 
http://hilbert-space.de/?p=66

The patch adds two bypasses between these instructions and MOV 
instruction, which uses arm_mull_low_part_dep() to check whether 
dependency is only on the low part of MUL destination.  Bypasses between 
MULL and ALU insns for RdLo can't be added, because between this pair of 
reservation there are existing bypasses.  However, in practice these 
multiply insns are rare, and on SPEC2K INT code low part of the result 
for such insns is never used.

-- 
Best regards,
   Dmitry

[-- Attachment #2: cortex-a8-fixes.diff --]
[-- Type: text/x-diff, Size: 9167 bytes --]

2012-02-09  Ruben Buchatskiy <rb@ispras.ru>

        * config/arm/arm-protos.h (arm_cortex_a8_no_early_alu_shift_dep,
        arm_mull_low_part_dep): Declare.
        * config/arm/arm.c (arm_cortex_a8_no_early_alu_shift_dep,
        arm_mull_low_part_dep, is_early_op): New function.
        * config/arm/arm.md (arm_shiftsi3): Add "mov" insn attribute.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 23a29c6..2a1334e 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -97,10 +97,12 @@ extern int neon_struct_mem_operand (rtx);
 extern int arm_no_early_store_addr_dep (rtx, rtx);
 extern int arm_early_store_addr_dep (rtx, rtx);
 extern int arm_early_load_addr_dep (rtx, rtx);
+extern int arm_cortex_a8_no_early_alu_shift_dep (rtx, rtx);
 extern int arm_no_early_alu_shift_dep (rtx, rtx);
 extern int arm_no_early_alu_shift_value_dep (rtx, rtx);
 extern int arm_no_early_mul_dep (rtx, rtx);
 extern int arm_mac_accumulator_is_mul_result (rtx, rtx);
+extern int arm_mull_low_part_dep (rtx, rtx);
 
 extern int tls_mentioned_p (rtx);
 extern int symbol_mentioned_p (rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ee26c51..e92c75b 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -23035,6 +23035,56 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
   return reg_overlap_mentioned_p (value, addr);
 }
 
+/* Return nonzero and copy *X to *DATA if *X is a SHIFT operand.
+   This is a callback for for_each_rtx in arm_no_early_alu_shift_dep().  */
+
+static int
+is_early_op (rtx *x, void *data)
+{
+  rtx *rtx_data = (rtx *) data;
+  enum rtx_code code;
+  code = GET_CODE (*x);
+
+  if (code == ASHIFT || code == ASHIFTRT || code == LSHIFTRT
+      || code == ROTATERT || code == ROTATE || code == MULT
+      || code == ZERO_EXTEND || code == SIGN_EXTEND)
+    {
+       *rtx_data = *x;
+       return 1;
+    }
+  else
+    return 0;
+}
+
+/* Return nonzero if the CONSUMER instruction (an ALU op) does not
+   have an early register shift value or amount dependency on the
+   result of PRODUCER.  */
+
+int
+arm_cortex_a8_no_early_alu_shift_dep (rtx producer, rtx consumer)
+{
+  rtx value = PATTERN (producer);
+  rtx op = PATTERN (consumer);
+  rtx early_op;
+
+  if (GET_CODE (value) == COND_EXEC)
+    value = COND_EXEC_CODE (value);
+  if (GET_CODE (value) == PARALLEL)
+    value = XVECEXP (value, 0, 0);
+  value = XEXP (value, 0);
+  if (GET_CODE (op) == COND_EXEC)
+    op = COND_EXEC_CODE (op);
+  if (GET_CODE (op) == PARALLEL)
+    op = XVECEXP (op, 0, 0);
+  op = XEXP (op, 1);
+
+  /* Traverse OP looking for SHIFT, ROTATE, SIGN_EXTEND.
+     EARLY_OP will hold the whole matching rtx.  */
+  for_each_rtx (&op, is_early_op, &early_op);
+
+  return !reg_overlap_mentioned_p (value, early_op);
+}
+
 /* Return nonzero if the CONSUMER instruction (an ALU op) does not
    have an early register shift value or amount dependency on the
    result of PRODUCER.  */
@@ -23132,6 +23182,42 @@ arm_no_early_mul_dep (rtx producer, rtx consumer)
   return 0;
 }
 
+/* Return nonzero if the CONSUMER (MULL insn) have a dependency only on the low
+   part of PRODUCER's result (RdLo register), which for some insns is available
+   one cycle earlier than its high part.  */
+
+int
+arm_mull_low_part_dep (rtx producer, rtx consumer)
+{
+  rtx value = PATTERN (producer);
+  rtx op = PATTERN (consumer);
+  enum machine_mode mode;
+  int dep = 0;
+
+  if (GET_CODE (value) == COND_EXEC)
+    value = COND_EXEC_CODE (value);
+  if (GET_CODE (value) == PARALLEL)
+    value = XVECEXP (value, 0, 0);
+  value = XEXP (value, 0);
+  if (GET_CODE (op) == COND_EXEC)
+    op = COND_EXEC_CODE (op);
+  if (GET_CODE (op) == PARALLEL)
+    op = XVECEXP (op, 0, 0);
+  op = XEXP (op, 1);
+
+  /* Save the current MODE of VALUE. */
+  mode = GET_MODE (value);
+  if (mode != SImode && mode != VOIDmode)
+    PUT_MODE(value, SImode);
+  if (reg_overlap_mentioned_p (value, op))
+    dep = 1;
+
+  /* Restore the saved MODE. */
+  PUT_MODE(value, mode);
+
+  return dep;
+}
+
 /* We can't rely on the caller doing the proper promotion when
    using APCS or ATPCS.  */
 
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 7ac3f5c..aef0ff5 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -3666,6 +3666,7 @@
   "* return arm_output_shift(operands, 0);"
   [(set_attr "predicable" "yes")
    (set_attr "shift" "1")
+   (set_attr "insn" "mov")
    (set (attr "type") (if_then_else (match_operand 2 "const_int_operand" "")
 		      (const_string "alu_shift")
 		      (const_string "alu_shift_reg")))]
diff --git a/gcc/config/arm/cortex-a8.md b/gcc/config/arm/cortex-a8.md
index 1922e5c..ef0b41b 100644
--- a/gcc/config/arm/cortex-a8.md
+++ b/gcc/config/arm/cortex-a8.md
@@ -117,19 +117,19 @@
 ;; (Such a pair can be issued in parallel, hence latency zero.)
 (define_bypass 0 "cortex_a8_mov" "cortex_a8_alu")
 (define_bypass 0 "cortex_a8_mov" "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 0 "cortex_a8_mov" "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; An ALU instruction followed by an ALU instruction with no early dep.
 (define_bypass 1 "cortex_a8_alu,cortex_a8_alu_shift,cortex_a8_alu_shift_reg"
                "cortex_a8_alu")
 (define_bypass 1 "cortex_a8_alu,cortex_a8_alu_shift,cortex_a8_alu_shift_reg"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 1 "cortex_a8_alu,cortex_a8_alu_shift,cortex_a8_alu_shift_reg"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; Multiplication instructions.  These are categorized according to their
 ;; reservation behavior and the need below to distinguish certain
@@ -149,7 +149,7 @@
 
 (define_insn_reservation "cortex_a8_mull" 7
   (and (eq_attr "tune" "cortexa8")
-       (eq_attr "insn" "smull,umull,smlal,umlal,umaal,smlalxy"))
+       (eq_attr "insn" "smull,umull,smlal,umlal,umaal"))
   "cortex_a8_multiply_3")
 
 (define_insn_reservation "cortex_a8_smulwy" 5
@@ -162,7 +162,7 @@
 ;; cannot go in cortex_a8_mla above.  (See below for bypass details.)
 (define_insn_reservation "cortex_a8_smlald" 6
   (and (eq_attr "tune" "cortexa8")
-       (eq_attr "insn" "smlald,smlsld"))
+       (eq_attr "insn" "smlald,smlsld,smlalxy"))
   "cortex_a8_multiply_2")
 
 ;; A multiply with a single-register result or an MLA, followed by an
@@ -174,17 +174,28 @@
 
 ;; A multiply followed by an ALU instruction needing the multiply
 ;; result only at E2 has lower latency than one needing it at E1.
-(define_bypass 4 "cortex_a8_mul,cortex_a8_mla,cortex_a8_mull,\
-                  cortex_a8_smulwy,cortex_a8_smlald"
+(define_bypass 5 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy,\
+                  cortex_a8_smlald"
+               "cortex_a8_alu")
+(define_bypass 6 "cortex_a8_mull"
                "cortex_a8_alu")
 (define_bypass 4 "cortex_a8_mul,cortex_a8_mla,cortex_a8_mull,\
                   cortex_a8_smulwy,cortex_a8_smlald"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 4 "cortex_a8_mul,cortex_a8_mla,cortex_a8_mull,\
                   cortex_a8_smulwy,cortex_a8_smlald"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
+
+;; A MULL followed by a MOV instruction needing the multiply
+;; result at E1, but result (RdLo) is available only at E4 stage.
+(define_bypass 6 "cortex_a8_mull"
+	       "cortex_a8_mov"
+	       "arm_mull_low_part_dep")
+(define_bypass 5 "cortex_a8_smlald"
+               "cortex_a8_mov"
+               "arm_mull_low_part_dep")
 
 ;; Load instructions.
 ;; The presence of any register writeback is ignored here.
@@ -201,10 +212,10 @@
                "cortex_a8_alu")
 (define_bypass 2 "cortex_a8_load1_2"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 2 "cortex_a8_load1_2"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; We do not currently model the fact that loads with scaled register
 ;; offsets that are not LSL #2 have an extra cycle latency (they issue
@@ -224,10 +235,10 @@
                "cortex_a8_alu")
 (define_bypass 4 "cortex_a8_load3_4"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 4 "cortex_a8_load3_4"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; Store instructions.
 ;; Writeback is again ignored.

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2012-02-09 15:01 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-09 15:09 [PATCH, ARM] Cortex-A8 backend fixes Dmitry Melnik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).