public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH, MIPS] add new peephole for 74k dspr2
@ 2012-08-16 15:13 Sandra Loosemore
  2012-08-19 17:23 ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Sandra Loosemore @ 2012-08-16 15:13 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

This patch adds a peephole optimization to use a clever trick to 
zero-initialize the two halves of an accumulator register with one 
instruction instead of a mtlo/mthi pair.  OK to check in?

-Sandra

2012-08-16  Sandra Loosemore  <sandra@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    MIPS Technologies, Inc.

	gcc/
	* config/mips/mips-dspr2.md (UNSPEC_ACC_INIT): Declare.
	(mult peephole2): Add peephole that converts
	"mtlo $ac[1-3],$0; mthi $ac[1-3],$0" into
         "mult $ac[1-3],$0,$0".
	(*mips_acc_init): New insn for above.



[-- Attachment #2: peep.patch --]
[-- Type: text/x-patch, Size: 1496 bytes --]

Index: gcc/config/mips/mips-dspr2.md
===================================================================
--- gcc/config/mips/mips-dspr2.md	(revision 190437)
+++ gcc/config/mips/mips-dspr2.md	(working copy)
@@ -68,6 +68,7 @@
   UNSPEC_DPAQX_SA_W_PH
   UNSPEC_DPSQX_S_W_PH
   UNSPEC_DPSQX_SA_W_PH
+  UNSPEC_ACC_INIT
 ])
 
 (define_insn "mips_absq_s_qb"
@@ -630,3 +631,33 @@
   [(set_attr "type"	"dspmacsat")
    (set_attr "accum_in" "1")
    (set_attr "mode"	"SI")])
+
+;; Convert  mtlo $ac[1-3],$0  =>  mult $ac[1-3],$0,$0
+;;          mthi $ac[1-3],$0
+(define_peephole2
+  [(set (match_operand:SI 0 "register_operand" "")
+	(const_int 0))
+   (set (match_operand:SI 1 "register_operand" "")
+	(const_int 0))]
+  "ISA_HAS_DSPR2
+   && !TARGET_MIPS16
+   && !TARGET_64BIT
+   && true_regnum (operands[0]) >= DSP_ACC_REG_FIRST
+   && true_regnum (operands[0]) <= DSP_ACC_REG_LAST
+   && true_regnum (operands[0]) / 2 == true_regnum (operands[1]) / 2"
+  [(parallel [(set (match_dup 0) (const_int 0))
+	      (set (match_dup 1) (const_int 0))
+	      (unspec [(const_int 0)] UNSPEC_ACC_INIT)])]
+)
+
+(define_insn "*mips_acc_init"
+  [(parallel
+    [(set (match_operand:SI 0 "register_operand" "=a") (const_int 0))
+     (set (match_operand:SI 1 "register_operand" "=a") (const_int 0))
+     (unspec [(const_int 0)] UNSPEC_ACC_INIT)])]
+  "ISA_HAS_DSPR2
+   && !TARGET_MIPS16
+   && !TARGET_64BIT"
+  "mult\t%q0,$0,$0\t\t# Clear ACC HI/LO"
+  [(set_attr "type"	"imul")
+   (set_attr "mode"	"SI")])

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-08-16 15:13 [PATCH, MIPS] add new peephole for 74k dspr2 Sandra Loosemore
@ 2012-08-19 17:23 ` Richard Sandiford
  2012-08-22 14:26   ` Sandra Loosemore
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-08-19 17:23 UTC (permalink / raw)
  To: Sandra Loosemore; +Cc: gcc-patches

Sandra Loosemore <sandra@codesourcery.com> writes:
> This patch adds a peephole optimization to use a clever trick to 
> zero-initialize the two halves of an accumulator register with one 
> instruction instead of a mtlo/mthi pair.  OK to check in?
>
> -Sandra
>
> 2012-08-16  Sandra Loosemore  <sandra@codesourcery.com>
> 	    Julian Brown  <julian@codesourcery.com>
> 	    MIPS Technologies, Inc.
>
> 	gcc/
> 	* config/mips/mips-dspr2.md (UNSPEC_ACC_INIT): Declare.
> 	(mult peephole2): Add peephole that converts
> 	"mtlo $ac[1-3],$0; mthi $ac[1-3],$0" into
>          "mult $ac[1-3],$0,$0".
> 	(*mips_acc_init): New insn for above.

Not sure whether a peephole is the right choice here.  In practice,
I'd imagine these opportunities would only come from a DImode move of
$0 into a doubleword register, so we could simply emit the pattern in
mips_split_doubleword_move.

That would also allow us to use it for plain HI and LO.  It wasn't
obvious from the patch why it was restricted to the DSP extension
registers.

Please also add a scan-assembler test.

Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-08-19 17:23 ` Richard Sandiford
@ 2012-08-22 14:26   ` Sandra Loosemore
  2012-08-27 16:37     ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Sandra Loosemore @ 2012-08-22 14:26 UTC (permalink / raw)
  To: gcc-patches, rdsandiford

[-- Attachment #1: Type: text/plain, Size: 756 bytes --]

On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>
> Not sure whether a peephole is the right choice here.  In practice,
> I'd imagine these opportunities would only come from a DImode move of
> $0 into a doubleword register, so we could simply emit the pattern in
> mips_split_doubleword_move.
>
> That would also allow us to use it for plain HI and LO.  It wasn't
> obvious from the patch why it was restricted to the DSP extension
> registers.
>
> Please also add a scan-assembler test.

How is this version of the fix?

-Sandra


2012-08-22  Sandra Loosemore  <sandra@codesourcery.com>

	gcc/
	* mips.c (mips_split_doubleword_move): Use mult instruction to
	zero-initialize accumulator.

	gcc/testsuite/
	* gcc.target/mips/mips32-dsp-accinit.c: New.



[-- Attachment #2: accinit.patch --]
[-- Type: text/x-patch, Size: 1524 bytes --]

Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c	(revision 190463)
+++ gcc/config/mips/mips.c	(working copy)
@@ -4158,6 +4158,14 @@ mips_split_doubleword_move (rtx dest, rt
       else
 	emit_insn (gen_mfhisi_di (mips_subword (dest, true), src));
     }
+  else if (!TARGET_64BIT && !TARGET_MIPS16 && ISA_HAS_DSP_MULT
+	   && src == const0_rtx
+	   && REG_P (dest) && ACC_REG_P (REGNO (dest)))
+    /* Zero-initialize accumulator using "mult $dest,$0,$0" instead
+       of a mthi/mtlo pair.  */
+    emit_insn (gen_mulsidi3_32bit (dest,
+				   gen_rtx_REG (SImode, GP_REG_FIRST),
+				   gen_rtx_REG (SImode, GP_REG_FIRST)));
   else
     {
       /* The operation can be split into two normal moves.  Decide in
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c
===================================================================
--- gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c	(revision 0)
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c	(revision 0)
@@ -0,0 +1,16 @@
+/* { dg-options "-O2 -march=74kc -mgp32" } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+  long long result = 0;
+  int i;
+
+  for (i = 0; i < n; i++)
+    result = __builtin_mips_madd (result, v[i], m);
+  return result;
+}
+
+/* { dg-final { scan-assembler "mult\t\\\$ac.,\\\$0,\\\$0" } } */

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-08-22 14:26   ` Sandra Loosemore
@ 2012-08-27 16:37     ` Richard Sandiford
  2012-09-18 17:18       ` PING " Sandra Loosemore
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-08-27 16:37 UTC (permalink / raw)
  To: Sandra Loosemore; +Cc: gcc-patches

Sandra Loosemore <sandra@codesourcery.com> writes:
> On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>>
>> Not sure whether a peephole is the right choice here.  In practice,
>> I'd imagine these opportunities would only come from a DImode move of
>> $0 into a doubleword register, so we could simply emit the pattern in
>> mips_split_doubleword_move.
>>
>> That would also allow us to use it for plain HI and LO.  It wasn't
>> obvious from the patch why it was restricted to the DSP extension
>> registers.
>>
>> Please also add a scan-assembler test.
>
> How is this version of the fix?

Just to say that I've not forgotten about this.  I'd still like to
remove the !TARGET_64BIT and ISA_HAS_DSP_MULT tests, because the
idea isn't specific to either.

Also, reviewing the patch made me realise that it might be better
to keep the move intact and simply use "mult" in the output code.
That's my fault for suggesting the wrong thing, though, so I was
hoping to find time this weekend to try it myself.  The testsuite
stuff ended up taking up all the available time instead.

Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-08-27 16:37     ` Richard Sandiford
@ 2012-09-18 17:18       ` Sandra Loosemore
  2012-09-18 18:17         ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Sandra Loosemore @ 2012-09-18 17:18 UTC (permalink / raw)
  To: gcc-patches, rdsandiford

On 08/27/2012 10:36 AM, Richard Sandiford wrote:
> Sandra Loosemore<sandra@codesourcery.com>  writes:
>> On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>>>
>>> Not sure whether a peephole is the right choice here.  In practice,
>>> I'd imagine these opportunities would only come from a DImode move of
>>> $0 into a doubleword register, so we could simply emit the pattern in
>>> mips_split_doubleword_move.
>>>
>>> That would also allow us to use it for plain HI and LO.  It wasn't
>>> obvious from the patch why it was restricted to the DSP extension
>>> registers.
>>>
>>> Please also add a scan-assembler test.
>>
>> How is this version of the fix?
>
> Just to say that I've not forgotten about this.  I'd still like to
> remove the !TARGET_64BIT and ISA_HAS_DSP_MULT tests, because the
> idea isn't specific to either.
>
> Also, reviewing the patch made me realise that it might be better
> to keep the move intact and simply use "mult" in the output code.
> That's my fault for suggesting the wrong thing, though, so I was
> hoping to find time this weekend to try it myself.  The testsuite
> stuff ended up taking up all the available time instead.

Richard,

Have you had time to think about this some more?  I am not sure I can 
guess how you'd like me to fix this patch now without some more specific 
review and/or suggestions about where the optimization should happen and 
what cases it should be extended to detect in addition to the dsp 
accumulator multiplies.

-Sandra

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-18 17:18       ` PING " Sandra Loosemore
@ 2012-09-18 18:17         ` Richard Sandiford
  2012-09-24 15:49           ` Maciej W. Rozycki
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-18 18:17 UTC (permalink / raw)
  To: Sandra Loosemore; +Cc: gcc-patches

Sandra Loosemore <sandra@codesourcery.com> writes:
> On 08/27/2012 10:36 AM, Richard Sandiford wrote:
>> Sandra Loosemore<sandra@codesourcery.com>  writes:
>>> On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>>>>
>>>> Not sure whether a peephole is the right choice here.  In practice,
>>>> I'd imagine these opportunities would only come from a DImode move of
>>>> $0 into a doubleword register, so we could simply emit the pattern in
>>>> mips_split_doubleword_move.
>>>>
>>>> That would also allow us to use it for plain HI and LO.  It wasn't
>>>> obvious from the patch why it was restricted to the DSP extension
>>>> registers.
>>>>
>>>> Please also add a scan-assembler test.
>>>
>>> How is this version of the fix?
>>
>> Just to say that I've not forgotten about this.  I'd still like to
>> remove the !TARGET_64BIT and ISA_HAS_DSP_MULT tests, because the
>> idea isn't specific to either.
>>
>> Also, reviewing the patch made me realise that it might be better
>> to keep the move intact and simply use "mult" in the output code.
>> That's my fault for suggesting the wrong thing, though, so I was
>> hoping to find time this weekend to try it myself.  The testsuite
>> stuff ended up taking up all the available time instead.
>
> Richard,
>
> Have you had time to think about this some more?  I am not sure I can 
> guess how you'd like me to fix this patch now without some more specific 
> review and/or suggestions about where the optimization should happen and 
> what cases it should be extended to detect in addition to the dsp 
> accumulator multiplies.

The patch below is the one I've been testing.  But I got sidetracked
by looking into the possibility of removing the MD0_REG and MD1_REG
classes, in order to get more sensible costs.  I think that was needed
for the madd-9.c test to pass.

Richard


Index: gcc/config/mips/mips-protos.h
===================================================================
--- gcc/config/mips/mips-protos.h	2012-09-03 07:49:57.319188985 +0100
+++ gcc/config/mips/mips-protos.h	2012-09-04 20:15:10.240130458 +0100
@@ -212,8 +212,8 @@ extern int m16_simm8_8 (rtx, enum machin
 extern int m16_nsimm8_8 (rtx, enum machine_mode);
 
 extern rtx mips_subword (rtx, bool);
-extern bool mips_split_64bit_move_p (rtx, rtx);
-extern void mips_split_doubleword_move (rtx, rtx);
+extern bool mips_split_move_p (rtx, rtx);
+extern void mips_split_move (rtx, rtx);
 extern const char *mips_output_move (rtx, rtx);
 extern bool mips_cfun_has_cprestore_slot_p (void);
 extern bool mips_cprestore_address_p (rtx, bool);
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c	2012-09-04 20:15:08.191130518 +0100
+++ gcc/config/mips/mips.c	2012-09-04 20:15:17.173130256 +0100
@@ -2395,11 +2395,11 @@ mips_load_store_insns (rtx mem, rtx insn
   mode = GET_MODE (mem);
 
   /* Try to prove that INSN does not need to be split.  */
-  might_split_p = true;
-  if (GET_MODE_BITSIZE (mode) == 64)
+  might_split_p = GET_MODE_SIZE (mode) > UNITS_PER_WORD;
+  if (might_split_p)
     {
       set = single_set (insn);
-      if (set && !mips_split_64bit_move_p (SET_DEST (set), SET_SRC (set)))
+      if (set && !mips_split_move_p (SET_DEST (set), SET_SRC (set)))
 	might_split_p = false;
     }
 
@@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
   return simplify_gen_subreg (word_mode, op, mode, byte);
 }
 
-/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
+/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
+
+static bool
+mips_mult_move_p (rtx dest, rtx src)
+{
+  return (src == const0_rtx
+	  && REG_P (dest)
+	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
+	  && (ISA_HAS_DSP_MULT
+	      ? ACC_REG_P (REGNO (dest))
+	      : MD_REG_P (REGNO (dest))));
+}
+
+/* Return true if a move from SRC to DEST should be split into two.  */
 
 bool
-mips_split_64bit_move_p (rtx dest, rtx src)
+mips_split_move_p (rtx dest, rtx src)
 {
-  if (TARGET_64BIT)
+  /* Check whether the move can be done using some variant of MULT $0,$0.  */
+  if (mips_mult_move_p (dest, src))
     return false;
 
   /* FPR-to-FPR moves can be done in a single instruction, if they're
      allowed at all.  */
-  if (FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
+  unsigned int size = GET_MODE_SIZE (GET_MODE (dest));
+  if (size == 8 && FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
     return false;
 
   /* Check for floating-point loads and stores.  */
-  if (ISA_HAS_LDC1_SDC1)
+  if (size == 8 && ISA_HAS_LDC1_SDC1)
     {
       if (FP_REG_RTX_P (dest) && MEM_P (src))
 	return false;
       if (FP_REG_RTX_P (src) && MEM_P (dest))
 	return false;
     }
-  return true;
+
+  /* Otherwise split all multiword moves.  */
+  return size > UNITS_PER_WORD;
 }
 
-/* Split a doubleword move from SRC to DEST.  On 32-bit targets,
-   this function handles 64-bit moves for which mips_split_64bit_move_p
-   holds.  For 64-bit targets, this function handles 128-bit moves.  */
+/* Split a move from SRC to DEST, given that mips_split_move_p holds.  */
 
 void
-mips_split_doubleword_move (rtx dest, rtx src)
+mips_split_move (rtx dest, rtx src)
 {
   rtx low_dest;
 
+  gcc_assert (mips_split_move_p (dest, src));
   if (FP_REG_RTX_P (dest) || FP_REG_RTX_P (src))
     {
       if (!TARGET_64BIT && GET_MODE (dest) == DImode)
@@ -4209,7 +4225,7 @@ mips_output_move (rtx dest, rtx src)
   mode = GET_MODE (dest);
   dbl_p = (GET_MODE_SIZE (mode) == 8);
 
-  if (dbl_p && mips_split_64bit_move_p (dest, src))
+  if (mips_split_move_p (dest, src))
     return "#";
 
   if ((src_code == REG && GP_REG_P (REGNO (src)))
@@ -4220,6 +4236,14 @@ mips_output_move (rtx dest, rtx src)
 	  if (GP_REG_P (REGNO (dest)))
 	    return "move\t%0,%z1";
 
+	  if (mips_mult_move_p (dest, src))
+	    {
+	      if (ISA_HAS_DSP_MULT)
+		return "mult\t%q0,%.,%.";
+	      else
+		return "mult\t%.,%.";
+	    }
+
 	  /* Moves to HI are handled by special .md insns.  */
 	  if (REGNO (dest) == LO_REGNUM)
 	    return "mtlo\t%z1";
@@ -10430,8 +10454,8 @@ mips_save_reg (rtx reg, rtx mem)
     {
       rtx x1, x2;
 
-      if (mips_split_64bit_move_p (mem, reg))
-	mips_split_doubleword_move (mem, reg);
+      if (mips_split_move_p (mem, reg))
+	mips_split_move (mem, reg);
       else
 	mips_emit_move (mem, reg);
 
Index: gcc/config/mips/mips.md
===================================================================
--- gcc/config/mips/mips.md	2012-09-03 07:49:57.318188985 +0100
+++ gcc/config/mips/mips.md	2012-09-04 20:15:10.293130456 +0100
@@ -204,7 +204,7 @@ (define_attr "jal_macro" "no,yes"
 ;; the split instructions; in some cases, it is more appropriate for the
 ;; scheduling type to be "multi" instead.
 (define_attr "move_type"
-  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,move,fmove,
+  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,imul,move,fmove,
    const,constN,signext,ext_ins,logical,arith,sll0,andi,loadpool,
    shift_shift"
   (const_string "unknown"))
@@ -369,6 +369,7 @@ (define_attr "type"
 	 (eq_attr "move_type" "mflo") (const_string "mflo")
 
 	 ;; These types of move are always single insns.
+	 (eq_attr "move_type" "imul") (const_string "imul")
 	 (eq_attr "move_type" "fmove") (const_string "fmove")
 	 (eq_attr "move_type" "loadpool") (const_string "load")
 	 (eq_attr "move_type" "signext") (const_string "signext")
@@ -4254,13 +4255,13 @@ (define_insn "*mov<mode>_ra"
    (set_attr "mode" "<MODE>")])
 
 (define_insn "*movdi_32bit"
-  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
-	(match_operand:DI 1 "move_operand" "d,i,m,d,*J*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
+	(match_operand:DI 1 "move_operand" "d,i,m,d,*J,*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
   "!TARGET_64BIT && !TARGET_MIPS16
    && (register_operand (operands[0], DImode)
        || reg_or_0_operand (operands[1], DImode))"
   { return mips_output_move (operands[0], operands[1]); }
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
    (set_attr "mode" "DI")])
 
 (define_insn "*movdi_32bit_mips16"
@@ -4707,14 +4708,14 @@ (define_expand "movti"
 })
 
 (define_insn "*movti"
-  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d")
-	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*d*J,*a"))]
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d")
+	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*J,*d,*a"))]
   "TARGET_64BIT
    && !TARGET_MIPS16
    && (register_operand (operands[0], TImode)
        || reg_or_0_operand (operands[1], TImode))"
-  "#"
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo")
+  { return mips_output_move (operands[0], operands[1]); }
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo")
    (set_attr "mode" "TI")])
 
 (define_insn "*movti_mips16"
@@ -4765,21 +4766,20 @@ (define_insn "*movtf_mips16"
 (define_split
   [(set (match_operand:MOVE64 0 "nonimmediate_operand")
 	(match_operand:MOVE64 1 "move_operand"))]
-  "reload_completed && !TARGET_64BIT
-   && mips_split_64bit_move_p (operands[0], operands[1])"
+  "reload_completed && mips_split_move_p (operands[0], operands[1])"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move (operands[0], operands[1]);
   DONE;
 })
 
 (define_split
   [(set (match_operand:MOVE128 0 "nonimmediate_operand")
 	(match_operand:MOVE128 1 "move_operand"))]
-  "TARGET_64BIT && reload_completed"
+  "reload_completed && mips_split_move_p (operands[0], operands[1])"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move (operands[0], operands[1]);
   DONE;
 })
 
Index: gcc/testsuite/gcc.target/mips/madd-9.c
===================================================================
--- gcc/testsuite/gcc.target/mips/madd-9.c	2012-09-03 07:49:57.318188985 +0100
+++ gcc/testsuite/gcc.target/mips/madd-9.c	2012-09-04 21:21:17.132015119 +0100
@@ -1,7 +1,13 @@
 /* { dg-do compile } */
 /* { dg-options "isa_rev>=1 -mgp32" } */
-/* { dg-skip-if "code quality test" { *-*-* } { "-O0" } { "" } } */
+/* References to X within the loop need to have a higher frequency than
+   references to X outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
 /* { dg-final { scan-assembler-not "\tmul\t" } } */
+/* { dg-final { scan-assembler-not "\tmthi" } } */
+/* { dg-final { scan-assembler-not "\tmtlo" } } */
+/* { dg-final { scan-assembler "\tmult\t" } } */
 /* { dg-final { scan-assembler "\tmadd\t" } } */
 
 NOMIPS16 long long
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c
===================================================================
--- /dev/null	2012-08-19 20:42:12.842999468 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c	2012-09-04 21:04:52.697043742 +0100
@@ -0,0 +1,20 @@
+/* { dg-options "-mdspr2 -mgp32" } */
+/* References to RESULT within the loop need to have a higher frequency than
+   references to RESULT outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+  long long result = 0;
+  int i;
+
+  for (i = 0; i < n; i++)
+    result = __builtin_mips_madd (result, v[i], m);
+  return result;
+}
+
+/* { dg-final { scan-assembler "\tmult\t\\\$ac.,\\\$0,\\\$0" } } */

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-18 18:17         ` Richard Sandiford
@ 2012-09-24 15:49           ` Maciej W. Rozycki
  2012-09-24 21:40             ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-09-24 15:49 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches

On Tue, 18 Sep 2012, Richard Sandiford wrote:

> > Have you had time to think about this some more?  I am not sure I can 
> > guess how you'd like me to fix this patch now without some more specific 
> > review and/or suggestions about where the optimization should happen and 
> > what cases it should be extended to detect in addition to the dsp 
> > accumulator multiplies.
> 
> The patch below is the one I've been testing.  But I got sidetracked
> by looking into the possibility of removing the MD0_REG and MD1_REG
> classes, in order to get more sensible costs.  I think that was needed
> for the madd-9.c test to pass.

 Sorry to come up with this so late -- I have only now noticed this being 
discussed.

> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>    return simplify_gen_subreg (word_mode, op, mode, byte);
>  }
>  
> -/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
> +/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
> +
> +static bool
> +mips_mult_move_p (rtx dest, rtx src)
> +{
> +  return (src == const0_rtx
> +	  && REG_P (dest)
> +	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
> +	  && (ISA_HAS_DSP_MULT
> +	      ? ACC_REG_P (REGNO (dest))
> +	      : MD_REG_P (REGNO (dest))));
> +}
> +
> +/* Return true if a move from SRC to DEST should be split into two.  */

 Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower 
than MTHI $0/MTLO $0?  The latency of multiplication varies among 
implementations, for example the original R3000 took 12 cycles (of course 
the R3000 itself is not relevant for this change, but you see the 
picture!).  On the other hand in some (but not all!) processors 
multiplication runs in parallel to the main pipeline so it is the 
difference, if positive, between the number of cycles consumed by other 
instructions up to the next HI/LO access instruction and the latency of 
MULT run in the background that matters.

 From the context I am assuming none of this matters for the 74K (and 
presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
isn't it something that should be decided based on instruction costs from 
DFA schedulers?  Is there anything that I've missed here?  It doesn't 
appear to me your (and neither the original) proposal takes instruction 
cost calculation into consideration.

  Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-24 15:49           ` Maciej W. Rozycki
@ 2012-09-24 21:40             ` Richard Sandiford
  2012-09-25  0:52               ` Maciej W. Rozycki
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-24 21:40 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches

"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Tue, 18 Sep 2012, Richard Sandiford wrote:
>
>> > Have you had time to think about this some more?  I am not sure I can 
>> > guess how you'd like me to fix this patch now without some more specific 
>> > review and/or suggestions about where the optimization should happen and 
>> > what cases it should be extended to detect in addition to the dsp 
>> > accumulator multiplies.
>> 
>> The patch below is the one I've been testing.  But I got sidetracked
>> by looking into the possibility of removing the MD0_REG and MD1_REG
>> classes, in order to get more sensible costs.  I think that was needed
>> for the madd-9.c test to pass.
>
>  Sorry to come up with this so late -- I have only now noticed this being 
> discussed.
>
>> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>>    return simplify_gen_subreg (word_mode, op, mode, byte);
>>  }
>>  
>> -/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
>> +/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
>> +
>> +static bool
>> +mips_mult_move_p (rtx dest, rtx src)
>> +{
>> +  return (src == const0_rtx
>> +	  && REG_P (dest)
>> +	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
>> +	  && (ISA_HAS_DSP_MULT
>> +	      ? ACC_REG_P (REGNO (dest))
>> +	      : MD_REG_P (REGNO (dest))));
>> +}
>> +
>> +/* Return true if a move from SRC to DEST should be split into two.  */
>
>  Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower 
> than MTHI $0/MTLO $0?  The latency of multiplication varies among 
> implementations, for example the original R3000 took 12 cycles (of course 
> the R3000 itself is not relevant for this change, but you see the 
> picture!).  On the other hand in some (but not all!) processors 
> multiplication runs in parallel to the main pipeline so it is the 
> difference, if positive, between the number of cycles consumed by other 
> instructions up to the next HI/LO access instruction and the latency of 
> MULT run in the background that matters.
>
>  From the context I am assuming none of this matters for the 74K (and 
> presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
> isn't it something that should be decided based on instruction costs from 
> DFA schedulers?  Is there anything that I've missed here?  It doesn't 
> appear to me your (and neither the original) proposal takes instruction 
> cost calculation into consideration.

In practice, we only move 0 into HI and LO for MADD- and MSUB-style
operations.  We deliberately don't use HI and LO as scratch space.

I think it's a reasonable default assumption that anything that supports
those instructions also has a fast path from MULT to MADD or MULT to MSUB.
I certainly don't know of any counter-examples.  The decision is deliberately
centeralised in one place so that the condition can be tweaked in future
if necessary.

Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-24 21:40             ` Richard Sandiford
@ 2012-09-25  0:52               ` Maciej W. Rozycki
  2012-09-25  8:38                 ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-09-25  0:52 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches

On Mon, 24 Sep 2012, Richard Sandiford wrote:

> >  From the context I am assuming none of this matters for the 74K (and 
> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
> > isn't it something that should be decided based on instruction costs from 
> > DFA schedulers?  Is there anything that I've missed here?  It doesn't 
> > appear to me your (and neither the original) proposal takes instruction 
> > cost calculation into consideration.
> 
> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
> operations.  We deliberately don't use HI and LO as scratch space.
> 
> I think it's a reasonable default assumption that anything that supports
> those instructions also has a fast path from MULT to MADD or MULT to MSUB.

 According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
obviously the resulting larger code may adversely affect cache performance 
in some scenarios.

> I certainly don't know of any counter-examples.  The decision is deliberately
> centeralised in one place so that the condition can be tweaked in future
> if necessary.

 Sure, the cost comparison could be done in that single place as well.

  Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-25  0:52               ` Maciej W. Rozycki
@ 2012-09-25  8:38                 ` Richard Sandiford
  2012-09-25 10:29                   ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-25  8:38 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches

"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Mon, 24 Sep 2012, Richard Sandiford wrote:
>
>> >  From the context I am assuming none of this matters for the 74K (and 
>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
>> > isn't it something that should be decided based on instruction costs from 
>> > DFA schedulers?  Is there anything that I've missed here?  It doesn't 
>> > appear to me your (and neither the original) proposal takes instruction 
>> > cost calculation into consideration.
>> 
>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
>> operations.  We deliberately don't use HI and LO as scratch space.
>> 
>> I think it's a reasonable default assumption that anything that supports
>> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
>
>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
> obviously the resulting larger code may adversely affect cache performance 
> in some scenarios.

That's not how the 4650 DFA models it though.

(define_insn_reservation "generic_hilo" 1
  (eq_attr "type" "mfhi,mflo,mthi,mtlo")
  "imuldiv*3")

(define_insn_reservation "r4650_imul" 4
  (and (eq_attr "cpu" "r4650")
       (eq_attr "type" "imul,imul3,imadd"))
  "imuldiv*4")

So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-25  8:38                 ` Richard Sandiford
@ 2012-09-25 10:29                   ` Richard Sandiford
  2012-09-25 18:06                     ` Maciej W. Rozycki
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-25 10:29 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches

Richard Sandiford <rdsandiford@googlemail.com> writes:
> "Maciej W. Rozycki" <macro@codesourcery.com> writes:
>> On Mon, 24 Sep 2012, Richard Sandiford wrote:
>>
>>> >  From the context I am assuming none of this matters for the 74K (and 
>>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
>>> > isn't it something that should be decided based on instruction costs from 
>>> > DFA schedulers?  Is there anything that I've missed here?  It doesn't 
>>> > appear to me your (and neither the original) proposal takes instruction 
>>> > cost calculation into consideration.
>>> 
>>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
>>> operations.  We deliberately don't use HI and LO as scratch space.
>>> 
>>> I think it's a reasonable default assumption that anything that supports
>>> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
>>
>>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
>> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
>> obviously the resulting larger code may adversely affect cache performance 
>> in some scenarios.
>
> That's not how the 4650 DFA models it though.
>
> (define_insn_reservation "generic_hilo" 1
>   (eq_attr "type" "mfhi,mflo,mthi,mtlo")
>   "imuldiv*3")
>
> (define_insn_reservation "r4650_imul" 4
>   (and (eq_attr "cpu" "r4650")
>        (eq_attr "type" "imul,imul3,imadd"))
>   "imuldiv*4")
>
> So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
required for this optimisation is getting a bit ridiculous.

Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-25 10:29                   ` Richard Sandiford
@ 2012-09-25 18:06                     ` Maciej W. Rozycki
  2012-10-07  8:45                       ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-09-25 18:06 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches

On Tue, 25 Sep 2012, Richard Sandiford wrote:

> >>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
> >> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
> >> obviously the resulting larger code may adversely affect cache performance 
> >> in some scenarios.
> >
> > That's not how the 4650 DFA models it though.
> >
> > (define_insn_reservation "generic_hilo" 1
> >   (eq_attr "type" "mfhi,mflo,mthi,mtlo")
> >   "imuldiv*3")
> >
> > (define_insn_reservation "r4650_imul" 4
> >   (and (eq_attr "cpu" "r4650")
> >        (eq_attr "type" "imul,imul3,imadd"))
> >   "imuldiv*4")
> >
> > So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> > rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

 I can't track a reference on R4650 MTHI/MTLO latency; I'd be happy to 
learn of one, or otherwise I wonder where the delay is coming from.  Also 
a small update: apparently MULT is 3 clocks only on the R4650 where 
operands are 16 bits (unsure if it is enough if only one is; for a zero by 
zero multiplication it surely does not matter though).  So I think using a 
MULT here is at least reasonable.

> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
> MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
> required for this optimisation is getting a bit ridiculous.

 I have double-checked some documentation, and in fact many MIPS cores, 
including the current ones, have a configuration option to include either 
a high-performance or an area-efficient MD unit.  Take the M14Kc for 
example -- its high-performance unit has a one-cycle latency/issue rate 
for 16-bit multiplication (two-cycle for full 32 bits; here the width of 
rt is explicitly named) and the area-efficient has a 32-cycle 
latency/issue rate only regardless of the operand size (obviously 
iterating over addition one bit at a time).  The latency of MTHI/MTLO is 1 
across both units.

 So I think this can't really be selected automatically for all cores, 
some human-supplied knowledge about the MD unit used is required -- that 
obviously affects other operations too, e.g. some multiplications 
involving a constant that may be cheaper to do either directly or with a 
sequence of additions depending on the MD unit present (unless optimising 
for size, of course).

  Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-09-25 18:06                     ` Maciej W. Rozycki
@ 2012-10-07  8:45                       ` Richard Sandiford
  2012-10-08 23:23                         ` Maciej W. Rozycki
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-10-07  8:45 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches

"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Tue, 25 Sep 2012, Richard Sandiford wrote:
>> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
>> MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
>> required for this optimisation is getting a bit ridiculous.
>
>  I have double-checked some documentation, and in fact many MIPS cores, 
> including the current ones, have a configuration option to include either 
> a high-performance or an area-efficient MD unit.  Take the M14Kc for 
> example -- its high-performance unit has a one-cycle latency/issue rate 
> for 16-bit multiplication (two-cycle for full 32 bits; here the width of 
> rt is explicitly named) and the area-efficient has a 32-cycle 
> latency/issue rate only regardless of the operand size (obviously 
> iterating over addition one bit at a time).  The latency of MTHI/MTLO is 1 
> across both units.
>
>  So I think this can't really be selected automatically for all cores, 
> some human-supplied knowledge about the MD unit used is required -- that 
> obviously affects other operations too, e.g. some multiplications 
> involving a constant that may be cheaper to do either directly or with a 
> sequence of additions depending on the MD unit present (unless optimising 
> for size, of course).

Yeah, as far as this optimisation goes, I think your original suggestion
of using the DFA to work out the best sequence is still right.  If the DFA
says that multiplications take a handful of cycles when they actually take
32 then we're not going to optimise multiplication very well whatever happens.

In practice, code destined for non-4kc cores with iterative multipliers
would probably tune well with -mtune=4kp.

Anyway, I said upthread that I was looking into removing MD0_REG and
MD1_REG because I thought I'd seen a case where that was needed for
madd-9.c to pass.  I can no longer reproduce that, so it was probably
pilot error.

The final patch ended up being much more complicated than I'd have liked,
but at least it should be easier to add other automatically-derived tuning
costs in future.

Tested on mipsisa64-elf and mipsisa32-elf.  Applied.

Richard


gcc/
	* config/mips/mips-protos.h (mips_split_type): New enum.
	(mips_split_64bit_move_p, mips_split_doubleword_move): Delete.
	(mips_split_move_p, mips_split_move, mips_split_move_insn_p)
	(mips_split_move_insn): Declare.
	* config/mips/mips.c (mips_tuning_info): New variable.
	(mips_load_store_insns): Use mips_split_move_insn_p instead of
	mips_split_64bit_move_p.
	(mips_emit_move_or_split, mips_mult_move_p): New functions.
	(mips_split_64bit_move_p): Rename to...
	(mips_split_move_p): ...this and take a mips_split_type argument.
	Generalize to all moves.  Call mips_mult_move_p.
	(mips_split_doubleword_move): Rename to...
	(mips_split_move): ...this and take a mips_split_type argument.
	Assert that mips_split_move_p holds.
	(mips_insn_split_type, mips_split_move_insn_p, mips_split_move_insn):
	New functions.
	(mips_output_move): Use mips_split_move_p instead of
	mips_split_64bit_move_p.  Handle MULT $0, $0 moves.
	(mips_save_reg): Use mips_emit_move_or_split.
	(mips_sim_reset): Assign to curr_state.  Call targetm.sched.init
	and advance_state.
	(mips_sim_init): Call targetm.sched.init_dfa_pre_cycle_insn and
	targetm.sched.init_dfa_post_cycle_insn, if defined.
	(mips_sim_next_cycle): Assign to curr_state.  Use advance_state
	instead of state_transition.
	(mips_sim_issue_insn): Assign to curr_state.  Use
	targetm.sched.variable_issue to see how many more insns
	can be issued.
	(mips_seq_time, mips_mult_zero_zero_cost)
	(mips_set_fast_mult_zero_zero_p, mips_set_tuning_info)
	(mips_expand_to_rtl_hook): New functions.
	(TARGET_EXPAND_TO_RTL_HOOK): Define.
	* config/mips/mips.md (move_type): Add imul.
	(type): Map imul move_types to imul.
	(*movdi_32bit, *movti): Add imul alternatives.
	Use mips_split_move_insn_p and mips_split_move_insn instead of
	mips_split_64bit_move_p and mips_split_doubleword_move in move
	splitters.

gcc/testsuite/
2012-10-07  Richard Sandiford  <rdsandiford@googlemail.com>
	    Sandra Loosemore  <sandra@codesourcery.com>

	* gcc.target/mips/madd-9.c: Force code to be tuned for the 4kc
	and test that the accumulator is initialized using MULT.
	* gcc.target/mips/mips32-dsp-accinit-1.c: New test.
	* gcc.target/mips/mips32-dsp-accinit-2.c: Likewise.

Index: gcc/config/mips/mips-protos.h
===================================================================
--- gcc/config/mips/mips-protos.h	2012-10-06 17:26:43.408618293 +0100
+++ gcc/config/mips/mips-protos.h	2012-10-07 08:21:40.892791332 +0100
@@ -173,6 +173,25 @@ enum mips_call_type {
   MIPS_CALL_EPILOGUE
 };
 
+/* Controls the conditions under which certain instructions are split.
+
+   SPLIT_IF_NECESSARY
+	Only perform splits that are necessary for correctness
+	(because no unsplit version exists).
+
+   SPLIT_FOR_SPEED
+	Perform splits that are necessary for correctness or
+	beneficial for code speed.
+
+   SPLIT_FOR_SIZE
+	Perform splits that are necessary for correctness or
+	beneficial for code size.  */
+enum mips_split_type {
+  SPLIT_IF_NECESSARY,
+  SPLIT_FOR_SPEED,
+  SPLIT_FOR_SIZE
+};
+
 extern bool mips_symbolic_constant_p (rtx, enum mips_symbol_context,
 				      enum mips_symbol_type *);
 extern int mips_regno_mode_ok_for_base_p (int, enum machine_mode, bool);
@@ -212,8 +231,10 @@ extern int m16_simm8_8 (rtx, enum machin
 extern int m16_nsimm8_8 (rtx, enum machine_mode);
 
 extern rtx mips_subword (rtx, bool);
-extern bool mips_split_64bit_move_p (rtx, rtx);
-extern void mips_split_doubleword_move (rtx, rtx);
+extern bool mips_split_move_p (rtx, rtx, enum mips_split_type);
+extern void mips_split_move (rtx, rtx, enum mips_split_type);
+extern bool mips_split_move_insn_p (rtx, rtx, rtx);
+extern void mips_split_move_insn (rtx, rtx, rtx);
 extern const char *mips_output_move (rtx, rtx);
 extern bool mips_cfun_has_cprestore_slot_p (void);
 extern bool mips_cprestore_address_p (rtx, bool);
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c	2012-10-06 22:24:11.262714074 +0100
+++ gcc/config/mips/mips.c	2012-10-07 09:38:39.656797917 +0100
@@ -265,6 +265,24 @@ static const char *const mips_fp_conditi
   MIPS_FP_CONDITIONS (STRINGIFY)
 };
 
+/* Tuning information that is automatically derived from other sources
+   (such as the scheduler).  */
+static struct {
+  /* The architecture and tuning settings that this structure describes.  */
+  enum processor arch;
+  enum processor tune;
+
+  /* True if this structure describes MIPS16 settings.  */
+  bool mips16_p;
+
+  /* True if the structure has been initialized.  */
+  bool initialized_p;
+
+  /* True if "MULT $0, $0" is preferable to "MTLO $0; MTHI $0"
+     when optimizing for speed.  */
+  bool fast_mult_zero_zero_p;
+} mips_tuning_info;
+
 /* Information about a function's frame layout.  */
 struct GTY(())  mips_frame_info {
   /* The size of the frame in bytes.  */
@@ -2395,11 +2413,11 @@ mips_load_store_insns (rtx mem, rtx insn
   mode = GET_MODE (mem);
 
   /* Try to prove that INSN does not need to be split.  */
-  might_split_p = true;
-  if (GET_MODE_BITSIZE (mode) == 64)
+  might_split_p = GET_MODE_SIZE (mode) > UNITS_PER_WORD;
+  if (might_split_p)
     {
       set = single_set (insn);
-      if (set && !mips_split_64bit_move_p (SET_DEST (set), SET_SRC (set)))
+      if (set && !mips_split_move_insn_p (SET_DEST (set), SET_SRC (set), insn))
 	might_split_p = false;
     }
 
@@ -2441,6 +2459,18 @@ mips_emit_move (rtx dest, rtx src)
 	  : emit_move_insn_1 (dest, src));
 }
 
+/* Emit a move from SRC to DEST, splitting compound moves into individual
+   instructions.  SPLIT_TYPE is the type of split to perform.  */
+
+static void
+mips_emit_move_or_split (rtx dest, rtx src, enum mips_split_type split_type)
+{
+  if (mips_split_move_p (dest, src, split_type))
+    mips_split_move (dest, src, split_type);
+  else
+    mips_emit_move (dest, src);
+}
+
 /* Emit an instruction of the form (set TARGET (CODE OP0)).  */
 
 static void
@@ -4119,39 +4149,60 @@ mips_subword (rtx op, bool high_p)
   return simplify_gen_subreg (word_mode, op, mode, byte);
 }
 
-/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
+/* Return true if SRC should be moved into DEST using "MULT $0, $0".
+   SPLIT_TYPE is the condition under which moves should be split.  */
+
+static bool
+mips_mult_move_p (rtx dest, rtx src, enum mips_split_type split_type)
+{
+  return ((split_type != SPLIT_FOR_SPEED
+	   || mips_tuning_info.fast_mult_zero_zero_p)
+	  && src == const0_rtx
+	  && REG_P (dest)
+	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
+	  && (ISA_HAS_DSP_MULT
+	      ? ACC_REG_P (REGNO (dest))
+	      : MD_REG_P (REGNO (dest))));
+}
+
+/* Return true if a move from SRC to DEST should be split into two.
+   SPLIT_TYPE describes the split condition.  */
 
 bool
-mips_split_64bit_move_p (rtx dest, rtx src)
+mips_split_move_p (rtx dest, rtx src, enum mips_split_type split_type)
 {
-  if (TARGET_64BIT)
+  /* Check whether the move can be done using some variant of MULT $0,$0.  */
+  if (mips_mult_move_p (dest, src, split_type))
     return false;
 
   /* FPR-to-FPR moves can be done in a single instruction, if they're
      allowed at all.  */
-  if (FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
+  unsigned int size = GET_MODE_SIZE (GET_MODE (dest));
+  if (size == 8 && FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
     return false;
 
   /* Check for floating-point loads and stores.  */
-  if (ISA_HAS_LDC1_SDC1)
+  if (size == 8 && ISA_HAS_LDC1_SDC1)
     {
       if (FP_REG_RTX_P (dest) && MEM_P (src))
 	return false;
       if (FP_REG_RTX_P (src) && MEM_P (dest))
 	return false;
     }
-  return true;
+
+  /* Otherwise split all multiword moves.  */
+  return size > UNITS_PER_WORD;
 }
 
-/* Split a doubleword move from SRC to DEST.  On 32-bit targets,
-   this function handles 64-bit moves for which mips_split_64bit_move_p
-   holds.  For 64-bit targets, this function handles 128-bit moves.  */
+/* Split a move from SRC to DEST, given that mips_split_move_p holds.
+   SPLIT_TYPE describes the split condition.  */
 
 void
-mips_split_doubleword_move (rtx dest, rtx src)
+mips_split_move (rtx dest, rtx src, enum mips_split_type split_type)
 {
   rtx low_dest;
 
+  gcc_checking_assert (mips_split_move_p (dest, src, split_type));
   if (FP_REG_RTX_P (dest) || FP_REG_RTX_P (src))
     {
       if (!TARGET_64BIT && GET_MODE (dest) == DImode)
@@ -4206,6 +4257,41 @@ mips_split_doubleword_move (rtx dest, rt
 	}
     }
 }
+
+/* Return the split type for instruction INSN.  */
+
+static enum mips_split_type
+mips_insn_split_type (rtx insn)
+{
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  if (bb)
+    {
+      if (optimize_bb_for_speed_p (bb))
+	return SPLIT_FOR_SPEED;
+      else
+	return SPLIT_FOR_SIZE;
+    }
+  /* Once CFG information has been removed, we should trust the optimization
+     decisions made by previous passes and only split where necessary.  */
+  return SPLIT_IF_NECESSARY;
+}
+
+/* Return true if a move from SRC to DEST in INSN should be split.  */
+
+bool
+mips_split_move_insn_p (rtx dest, rtx src, rtx insn)
+{
+  return mips_split_move_p (dest, src, mips_insn_split_type (insn));
+}
+
+/* Split a move from SRC to DEST in INSN, given that mips_split_move_insn_p
+   holds.  */
+
+void
+mips_split_move_insn (rtx dest, rtx src, rtx insn)
+{
+  mips_split_move (dest, src, mips_insn_split_type (insn));
+}
 \f
 /* Return the appropriate instructions to move SRC into DEST.  Assume
    that SRC is operand 1 and DEST is operand 0.  */
@@ -4223,7 +4309,7 @@ mips_output_move (rtx dest, rtx src)
   mode = GET_MODE (dest);
   dbl_p = (GET_MODE_SIZE (mode) == 8);
 
-  if (dbl_p && mips_split_64bit_move_p (dest, src))
+  if (mips_split_move_p (dest, src, SPLIT_IF_NECESSARY))
     return "#";
 
   if ((src_code == REG && GP_REG_P (REGNO (src)))
@@ -4234,6 +4320,14 @@ mips_output_move (rtx dest, rtx src)
 	  if (GP_REG_P (REGNO (dest)))
 	    return "move\t%0,%z1";
 
+	  if (mips_mult_move_p (dest, src, SPLIT_IF_NECESSARY))
+	    {
+	      if (ISA_HAS_DSP_MULT)
+		return "mult\t%q0,%.,%.";
+	      else
+		return "mult\t%.,%.";
+	    }
+
 	  /* Moves to HI are handled by special .md insns.  */
 	  if (REGNO (dest) == LO_REGNUM)
 	    return "mtlo\t%z1";
@@ -10444,10 +10538,7 @@ mips_save_reg (rtx reg, rtx mem)
     {
       rtx x1, x2;
 
-      if (mips_split_64bit_move_p (mem, reg))
-	mips_split_doubleword_move (mem, reg);
-      else
-	mips_emit_move (mem, reg);
+      mips_emit_move_or_split (mem, reg, SPLIT_IF_NECESSARY);
 
       x1 = mips_frame_set (mips_subword (mem, false),
 			   mips_subword (reg, false));
@@ -14907,10 +14998,15 @@ struct mips_sim {
 static void
 mips_sim_reset (struct mips_sim *state)
 {
+  curr_state = state->dfa_state;
+
   state->time = 0;
   state->insns_left = state->issue_rate;
   memset (&state->last_set, 0, sizeof (state->last_set));
-  state_reset (state->dfa_state);
+  state_reset (curr_state);
+
+  targetm.sched.init (0, false, 0);
+  advance_state (curr_state);
 }
 
 /* Initialize STATE before its first use.  DFA_STATE points to an
@@ -14919,6 +15015,12 @@ mips_sim_reset (struct mips_sim *state)
 static void
 mips_sim_init (struct mips_sim *state, state_t dfa_state)
 {
+  if (targetm.sched.init_dfa_pre_cycle_insn)
+    targetm.sched.init_dfa_pre_cycle_insn ();
+
+  if (targetm.sched.init_dfa_post_cycle_insn)
+    targetm.sched.init_dfa_post_cycle_insn ();
+
   state->issue_rate = mips_issue_rate ();
   state->dfa_state = dfa_state;
   mips_sim_reset (state);
@@ -14929,9 +15031,11 @@ mips_sim_init (struct mips_sim *state, s
 static void
 mips_sim_next_cycle (struct mips_sim *state)
 {
+  curr_state = state->dfa_state;
+
   state->time++;
   state->insns_left = state->issue_rate;
-  state_transition (state->dfa_state, 0);
+  advance_state (curr_state);
 }
 
 /* Advance simulation state STATE until instruction INSN can read
@@ -15037,8 +15141,11 @@ mips_sim_record_set (rtx x, const_rtx pa
 static void
 mips_sim_issue_insn (struct mips_sim *state, rtx insn)
 {
-  state_transition (state->dfa_state, insn);
-  state->insns_left--;
+  curr_state = state->dfa_state;
+
+  state_transition (curr_state, insn);
+  state->insns_left = targetm.sched.variable_issue (0, false, insn,
+						    state->insns_left);
 
   mips_sim_insn = insn;
   note_stores (PATTERN (insn), mips_sim_record_set, state);
@@ -15089,6 +15196,109 @@ mips_sim_finish_insn (struct mips_sim *s
       break;
     }
 }
+
+/* Use simulator state STATE to calculate the execution time of
+   instruction sequence SEQ.  */
+
+static unsigned int
+mips_seq_time (struct mips_sim *state, rtx seq)
+{
+  mips_sim_reset (state);
+  for (rtx insn = seq; insn; insn = NEXT_INSN (insn))
+    {
+      mips_sim_wait_insn (state, insn);
+      mips_sim_issue_insn (state, insn);
+    }
+  return state->time;
+}
+\f
+/* Return the execution-time cost of mips_tuning_info.fast_mult_zero_zero_p
+   setting SETTING, using STATE to simulate instruction sequences.  */
+
+static unsigned int
+mips_mult_zero_zero_cost (struct mips_sim *state, bool setting)
+{
+  mips_tuning_info.fast_mult_zero_zero_p = setting;
+  start_sequence ();
+
+  enum machine_mode dword_mode = TARGET_64BIT ? TImode : DImode;
+  rtx hilo = gen_rtx_REG (dword_mode, MD_REG_FIRST);
+  mips_emit_move_or_split (hilo, const0_rtx, SPLIT_FOR_SPEED);
+
+  /* If the target provides mulsidi3_32bit then that's the most likely
+     consumer of the result.  Test for bypasses.  */
+  if (dword_mode == DImode && HAVE_maddsidi4)
+    {
+      rtx gpr = gen_rtx_REG (SImode, GP_REG_FIRST + 4);
+      emit_insn (gen_maddsidi4 (hilo, gpr, gpr, hilo));
+    }
+
+  unsigned int time = mips_seq_time (state, get_insns ());
+  end_sequence ();
+  return time;
+}
+
+/* Check the relative speeds of "MULT $0,$0" and "MTLO $0; MTHI $0"
+   and set up mips_tuning_info.fast_mult_zero_zero_p accordingly.
+   Prefer MULT -- which is shorter -- in the event of a tie.  */
+
+static void
+mips_set_fast_mult_zero_zero_p (struct mips_sim *state)
+{
+  if (TARGET_MIPS16)
+    /* No MTLO or MTHI available.  */
+    mips_tuning_info.fast_mult_zero_zero_p = true;
+  else
+    {
+      unsigned int true_time = mips_mult_zero_zero_cost (state, true);
+      unsigned int false_time = mips_mult_zero_zero_cost (state, false);
+      mips_tuning_info.fast_mult_zero_zero_p = (true_time <= false_time);
+    }
+}
+
+/* Set up costs based on the current architecture and tuning settings.  */
+
+static void
+mips_set_tuning_info (void)
+{
+  if (mips_tuning_info.initialized_p
+      && mips_tuning_info.arch == mips_arch
+      && mips_tuning_info.tune == mips_tune
+      && mips_tuning_info.mips16_p == TARGET_MIPS16)
+    return;
+
+  mips_tuning_info.arch = mips_arch;
+  mips_tuning_info.tune = mips_tune;
+  mips_tuning_info.mips16_p = TARGET_MIPS16;
+  mips_tuning_info.initialized_p = true;
+
+  dfa_start ();
+
+  struct mips_sim state;
+  mips_sim_init (&state, alloca (state_size ()));
+
+  mips_set_fast_mult_zero_zero_p (&state);
+
+  dfa_finish ();
+}
+
+/* Implement TARGET_EXPAND_TO_RTL_HOOK.  */
+
+static void
+mips_expand_to_rtl_hook (void)
+{
+  /* We need to call this at a point where we can safely create sequences
+     of instructions, so TARGET_OVERRIDE_OPTIONS is too early.  We also
+     need to call it at a point where the DFA infrastructure is not
+     already in use, so we can't just call it lazily on demand.
+
+     At present, mips_tuning_info is only needed during post-expand
+     RTL passes such as split_insns, so this hook should be early enough.
+     We may need to move the call elsewhere if mips_tuning_info starts
+     to be used for other things (such as rtx_costs, or expanders that
+     could be called during gimple optimization).  */
+  mips_set_tuning_info ();
+}
 \f
 /* The VR4130 pipeline issues aligned pairs of instructions together,
    but it stalls the second instruction if it depends on the first.
@@ -17760,6 +17970,8 @@ #define TARGET_MACHINE_DEPENDENT_REORG m
 #undef  TARGET_PREFERRED_RELOAD_CLASS
 #define TARGET_PREFERRED_RELOAD_CLASS mips_preferred_reload_class
 
+#undef TARGET_EXPAND_TO_RTL_HOOK
+#define TARGET_EXPAND_TO_RTL_HOOK mips_expand_to_rtl_hook
 #undef TARGET_ASM_FILE_START
 #define TARGET_ASM_FILE_START mips_file_start
 #undef TARGET_ASM_FILE_START_FILE_DIRECTIVE
Index: gcc/config/mips/mips.md
===================================================================
--- gcc/config/mips/mips.md	2012-10-06 21:08:44.902431555 +0100
+++ gcc/config/mips/mips.md	2012-10-06 22:24:24.854713317 +0100
@@ -204,7 +204,7 @@ (define_attr "jal_macro" "no,yes"
 ;; the split instructions; in some cases, it is more appropriate for the
 ;; scheduling type to be "multi" instead.
 (define_attr "move_type"
-  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,move,fmove,
+  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,imul,move,fmove,
    const,constN,signext,ext_ins,logical,arith,sll0,andi,loadpool,
    shift_shift"
   (const_string "unknown"))
@@ -369,6 +369,7 @@ (define_attr "type"
 	 (eq_attr "move_type" "mflo") (const_string "mflo")
 
 	 ;; These types of move are always single insns.
+	 (eq_attr "move_type" "imul") (const_string "imul")
 	 (eq_attr "move_type" "fmove") (const_string "fmove")
 	 (eq_attr "move_type" "loadpool") (const_string "load")
 	 (eq_attr "move_type" "signext") (const_string "signext")
@@ -4242,14 +4243,17 @@ (define_insn "*mov<mode>_ra"
    (set_attr "mode" "<MODE>")])
 
 (define_insn "*movdi_32bit"
-  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
-	(match_operand:DI 1 "move_operand" "d,i,m,d,*J*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
+	(match_operand:DI 1 "move_operand" "d,i,m,d,*J,*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
   "!TARGET_64BIT && !TARGET_MIPS16
    && (register_operand (operands[0], DImode)
        || reg_or_0_operand (operands[1], DImode))"
   { return mips_output_move (operands[0], operands[1]); }
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
-   (set_attr "mode" "DI")])
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
+   (set (attr "mode")
+   	(if_then_else (eq_attr "move_type" "imul")
+		      (const_string "SI")
+		      (const_string "DI")))])
 
 (define_insn "*movdi_32bit_mips16"
   [(set (match_operand:DI 0 "nonimmediate_operand" "=d,y,d,d,d,d,m,*d")
@@ -4695,15 +4699,18 @@ (define_expand "movti"
 })
 
 (define_insn "*movti"
-  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d")
-	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*d*J,*a"))]
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d")
+	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*J,*d,*a"))]
   "TARGET_64BIT
    && !TARGET_MIPS16
    && (register_operand (operands[0], TImode)
        || reg_or_0_operand (operands[1], TImode))"
-  "#"
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo")
-   (set_attr "mode" "TI")])
+  { return mips_output_move (operands[0], operands[1]); }
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo")
+   (set (attr "mode")
+   	(if_then_else (eq_attr "move_type" "imul")
+		      (const_string "SI")
+		      (const_string "TI")))])
 
 (define_insn "*movti_mips16"
   [(set (match_operand:TI 0 "nonimmediate_operand" "=d,y,d,d,d,d,m,*d")
@@ -4753,21 +4760,20 @@ (define_insn "*movtf_mips16"
 (define_split
   [(set (match_operand:MOVE64 0 "nonimmediate_operand")
 	(match_operand:MOVE64 1 "move_operand"))]
-  "reload_completed && !TARGET_64BIT
-   && mips_split_64bit_move_p (operands[0], operands[1])"
+  "reload_completed && mips_split_move_insn_p (operands[0], operands[1], insn)"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move_insn (operands[0], operands[1], curr_insn);
   DONE;
 })
 
 (define_split
   [(set (match_operand:MOVE128 0 "nonimmediate_operand")
 	(match_operand:MOVE128 1 "move_operand"))]
-  "TARGET_64BIT && reload_completed"
+  "reload_completed && mips_split_move_insn_p (operands[0], operands[1], insn)"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move_insn (operands[0], operands[1], curr_insn);
   DONE;
 })
 
Index: gcc/testsuite/gcc.target/mips/madd-9.c
===================================================================
--- gcc/testsuite/gcc.target/mips/madd-9.c	2012-10-06 17:26:43.407618293 +0100
+++ gcc/testsuite/gcc.target/mips/madd-9.c	2012-10-06 22:24:24.855713317 +0100
@@ -1,7 +1,13 @@
 /* { dg-do compile } */
-/* { dg-options "isa_rev>=1 -mgp32" } */
-/* { dg-skip-if "code quality test" { *-*-* } { "-O0" } { "" } } */
+/* { dg-options "isa_rev>=1 -mgp32 -mtune=4kc" } */
+/* References to X within the loop need to have a higher frequency than
+   references to X outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
 /* { dg-final { scan-assembler-not "\tmul\t" } } */
+/* { dg-final { scan-assembler-not "\tmthi" } } */
+/* { dg-final { scan-assembler-not "\tmtlo" } } */
+/* { dg-final { scan-assembler "\tmult\t" } } */
 /* { dg-final { scan-assembler "\tmadd\t" } } */
 
 NOMIPS16 long long
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-1.c
===================================================================
--- /dev/null	2012-10-06 09:26:21.400998949 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-1.c	2012-10-06 22:24:24.856713316 +0100
@@ -0,0 +1,22 @@
+/* { dg-options "-mdspr2 -mgp32 -mtune=74kc" } */
+/* References to RESULT within the loop need to have a higher frequency than
+   references to RESULT outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+  long long result = 0;
+  int i;
+
+  for (i = 0; i < n; i++)
+    result = __builtin_mips_madd (result, v[i], m);
+  return result;
+}
+
+/* { dg-final { scan-assembler "\tmult\t\\\$ac.,\\\$0,\\\$0" } } */
+/* { dg-final { scan-assembler-not "mthi\t" } } */
+/* { dg-final { scan-assembler-not "mtlo\t" } } */
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
===================================================================
--- /dev/null	2012-10-06 09:26:21.400998949 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-06 22:24:24.857713316 +0100
@@ -0,0 +1,22 @@
+/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
+/* References to RESULT within the loop need to have a higher frequency than
+   references to RESULT outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+  long long result = 0;
+  int i;
+
+  for (i = 0; i < n; i++)
+    result = __builtin_mips_madd (result, v[i], m);
+  return result;
+}
+
+/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
+/* { dg-final { scan-assembler "\tmthi\t" } } */
+/* { dg-final { scan-assembler "\tmtlo\t" } } */

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-10-07  8:45                       ` Richard Sandiford
@ 2012-10-08 23:23                         ` Maciej W. Rozycki
  2012-10-10 20:00                           ` Richard Sandiford
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-10-08 23:23 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches

On Sun, 7 Oct 2012, Richard Sandiford wrote:

> >  So I think this can't really be selected automatically for all cores, 
> > some human-supplied knowledge about the MD unit used is required -- that 
> > obviously affects other operations too, e.g. some multiplications 
> > involving a constant that may be cheaper to do either directly or with a 
> > sequence of additions depending on the MD unit present (unless optimising 
> > for size, of course).
> 
> Yeah, as far as this optimisation goes, I think your original suggestion
> of using the DFA to work out the best sequence is still right.  If the DFA
> says that multiplications take a handful of cycles when they actually take
> 32 then we're not going to optimise multiplication very well whatever happens.
> 
> In practice, code destined for non-4kc cores with iterative multipliers
> would probably tune well with -mtune=4kp.

 Hmm, maybe, although I find requiring people to tune for an obsolete 
MIPS32 rev. 1 processor to get the right MDU performance figures for 
modern processors a bit odd (the 4Kp DFA undoubtedly lacks reservations 
for newer instructions introduced with the MIPS32r2 architecture update).

 I've thought of -miterative-mdu or suchlike a knob that would override 
the default cost of multiplication/division as appropriate (i.e. to 32/64 
plus any extra operation-specific constant as required), perhaps by 
forcing the 4Kp insn reservation (extended to 64 bits as required) over 
the usual one; of course there would be nothing to override the -mtune=4Kp 
arrangement with.  Nothing urgent though, just an idea to ponder.

> The final patch ended up being much more complicated than I'd have liked,
> but at least it should be easier to add other automatically-derived tuning
> costs in future.

 Thanks for taking my concerns into account.  One nit below.

> Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
> ===================================================================
> --- /dev/null	2012-10-06 09:26:21.400998949 +0100
> +++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-06 22:24:24.857713316 +0100
> @@ -0,0 +1,22 @@
> +/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
> +/* References to RESULT within the loop need to have a higher frequency than
> +   references to RESULT outside the loop, otherwise there is no reason
> +   to prefer multiply/accumulator registers over GPRs.  */
> +/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
> +
> +/* Check that the zero-initialization of the accumulator feeding into
> +   the madd is done by means of a mult instruction instead of mthi/mtlo.  */

 This comment looks reversed to me as in this case we do NOT want a MULT 
to be used...

> +
> +NOMIPS16 long long f (int n, int *v, int m)
> +{
> +  long long result = 0;
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    result = __builtin_mips_madd (result, v[i], m);
> +  return result;
> +}
> +
> +/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
> +/* { dg-final { scan-assembler "\tmthi\t" } } */
> +/* { dg-final { scan-assembler "\tmtlo\t" } } */

 ... and in fact you do check that we didn't get one.

  Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-10-08 23:23                         ` Maciej W. Rozycki
@ 2012-10-10 20:00                           ` Richard Sandiford
  2012-10-11 15:29                             ` Maciej W. Rozycki
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-10-10 20:00 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches

"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Sun, 7 Oct 2012, Richard Sandiford wrote:
>
>> >  So I think this can't really be selected automatically for all cores, 
>> > some human-supplied knowledge about the MD unit used is required -- that 
>> > obviously affects other operations too, e.g. some multiplications 
>> > involving a constant that may be cheaper to do either directly or with a 
>> > sequence of additions depending on the MD unit present (unless optimising 
>> > for size, of course).
>> 
>> Yeah, as far as this optimisation goes, I think your original suggestion
>> of using the DFA to work out the best sequence is still right.  If the DFA
>> says that multiplications take a handful of cycles when they actually take
>> 32 then we're not going to optimise multiplication very well whatever happens.
>> 
>> In practice, code destined for non-4kc cores with iterative multipliers
>> would probably tune well with -mtune=4kp.
>
>  Hmm, maybe, although I find requiring people to tune for an obsolete 
> MIPS32 rev. 1 processor to get the right MDU performance figures for 
> modern processors a bit odd (the 4Kp DFA undoubtedly lacks reservations 
> for newer instructions introduced with the MIPS32r2 architecture update).
>
>  I've thought of -miterative-mdu or suchlike a knob that would override 
> the default cost of multiplication/division as appropriate (i.e. to 32/64 
> plus any extra operation-specific constant as required), perhaps by 
> forcing the 4Kp insn reservation (extended to 64 bits as required) over 
> the usual one; of course there would be nothing to override the -mtune=4Kp 
> arrangement with.  Nothing urgent though, just an idea to ponder.

FWIW, using the 4Kp reservation itself wouldn't necessarily work,
because all modern scheduling descriptions define their own CPU units.
You'd probably need to add new reservations to each affected MIPS32
and MIPS64 .md file.

>> Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
>> ===================================================================
>> --- /dev/null	2012-10-06 09:26:21.400998949 +0100
>> +++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-06 22:24:24.857713316 +0100
>> @@ -0,0 +1,22 @@
>> +/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
>> +/* References to RESULT within the loop need to have a higher frequency than
>> +   references to RESULT outside the loop, otherwise there is no reason
>> +   to prefer multiply/accumulator registers over GPRs.  */
>> +/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
>> +
>> +/* Check that the zero-initialization of the accumulator feeding into
>> +   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
>
>  This comment looks reversed to me as in this case we do NOT want a MULT 
> to be used...
>
>> +
>> +NOMIPS16 long long f (int n, int *v, int m)
>> +{
>> +  long long result = 0;
>> +  int i;
>> +
>> +  for (i = 0; i < n; i++)
>> +    result = __builtin_mips_madd (result, v[i], m);
>> +  return result;
>> +}
>> +
>> +/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
>> +/* { dg-final { scan-assembler "\tmthi\t" } } */
>> +/* { dg-final { scan-assembler "\tmtlo\t" } } */
>
>  ... and in fact you do check that we didn't get one.

Thanks, fixed as follows.

Richard


gcc/testsuite/
	* gcc.target/mips/mips32-dsp-accinit-2.c: Fix test description.

Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
===================================================================
--- gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-07 09:45:04.000000000 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-10 20:51:44.402528138 +0100
@@ -5,7 +5,8 @@
 /* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
 
 /* Check that the zero-initialization of the accumulator feeding into
-   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+   the madd is done by means of an mthi & mtlo pair instead of a
+   "mult $0,$0" instruction.  */
 
 NOMIPS16 long long f (int n, int *v, int m)
 {

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
  2012-10-10 20:00                           ` Richard Sandiford
@ 2012-10-11 15:29                             ` Maciej W. Rozycki
  0 siblings, 0 replies; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-10-11 15:29 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches

On Wed, 10 Oct 2012, Richard Sandiford wrote:

> >  I've thought of -miterative-mdu or suchlike a knob that would override 
> > the default cost of multiplication/division as appropriate (i.e. to 32/64 
> > plus any extra operation-specific constant as required), perhaps by 
> > forcing the 4Kp insn reservation (extended to 64 bits as required) over 
> > the usual one; of course there would be nothing to override the -mtune=4Kp 
> > arrangement with.  Nothing urgent though, just an idea to ponder.
> 
> FWIW, using the 4Kp reservation itself wouldn't necessarily work,
> because all modern scheduling descriptions define their own CPU units.
> You'd probably need to add new reservations to each affected MIPS32
> and MIPS64 .md file.

 Thanks for the hint.  I have skimmed through documentation I have 
available and the following MTI processors implement an iterative MDU as 
an RTL option: M4K, M14K, M14Kc, and the following ones always have one: 
4Kp, 4KEp.  No 64-bit MTI processor had it so far it would seem (the 5K 
chips have an iterative divider and a high-performance multiplier).

 We currently have no pipeline descriptions for any of the M cores, so my 
concern seemed to be a little bit premature (though still valid of 
course).  The option proposed will only be needed once/if we get any of 
the respective pipeline descriptions.  Arguably anyone targetting them is 
likely going to be after -Os anyway, although at least with the M14Kc I 
think it should not be taken for granted.

 We're missing a pipeline description for the 4KE-family cores as well.  
If not adding costs for the extra MIPS32r2 additions I think at the very 
least we should trivially enable the existing bits of 4k.md for these 
cores; the pipeline is the same across the whole 4K/4KE family and it's 
only the extra instructions that have been added with the E update.

  Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-10-11 15:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-16 15:13 [PATCH, MIPS] add new peephole for 74k dspr2 Sandra Loosemore
2012-08-19 17:23 ` Richard Sandiford
2012-08-22 14:26   ` Sandra Loosemore
2012-08-27 16:37     ` Richard Sandiford
2012-09-18 17:18       ` PING " Sandra Loosemore
2012-09-18 18:17         ` Richard Sandiford
2012-09-24 15:49           ` Maciej W. Rozycki
2012-09-24 21:40             ` Richard Sandiford
2012-09-25  0:52               ` Maciej W. Rozycki
2012-09-25  8:38                 ` Richard Sandiford
2012-09-25 10:29                   ` Richard Sandiford
2012-09-25 18:06                     ` Maciej W. Rozycki
2012-10-07  8:45                       ` Richard Sandiford
2012-10-08 23:23                         ` Maciej W. Rozycki
2012-10-10 20:00                           ` Richard Sandiford
2012-10-11 15:29                             ` Maciej W. Rozycki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).