* [PATCH, MIPS] add new peephole for 74k dspr2
@ 2012-08-16 15:13 Sandra Loosemore
2012-08-19 17:23 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Sandra Loosemore @ 2012-08-16 15:13 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 558 bytes --]
This patch adds a peephole optimization to use a clever trick to
zero-initialize the two halves of an accumulator register with one
instruction instead of a mtlo/mthi pair. OK to check in?
-Sandra
2012-08-16 Sandra Loosemore <sandra@codesourcery.com>
Julian Brown <julian@codesourcery.com>
MIPS Technologies, Inc.
gcc/
* config/mips/mips-dspr2.md (UNSPEC_ACC_INIT): Declare.
(mult peephole2): Add peephole that converts
"mtlo $ac[1-3],$0; mthi $ac[1-3],$0" into
"mult $ac[1-3],$0,$0".
(*mips_acc_init): New insn for above.
[-- Attachment #2: peep.patch --]
[-- Type: text/x-patch, Size: 1496 bytes --]
Index: gcc/config/mips/mips-dspr2.md
===================================================================
--- gcc/config/mips/mips-dspr2.md (revision 190437)
+++ gcc/config/mips/mips-dspr2.md (working copy)
@@ -68,6 +68,7 @@
UNSPEC_DPAQX_SA_W_PH
UNSPEC_DPSQX_S_W_PH
UNSPEC_DPSQX_SA_W_PH
+ UNSPEC_ACC_INIT
])
(define_insn "mips_absq_s_qb"
@@ -630,3 +631,33 @@
[(set_attr "type" "dspmacsat")
(set_attr "accum_in" "1")
(set_attr "mode" "SI")])
+
+;; Convert mtlo $ac[1-3],$0 => mult $ac[1-3],$0,$0
+;; mthi $ac[1-3],$0
+(define_peephole2
+ [(set (match_operand:SI 0 "register_operand" "")
+ (const_int 0))
+ (set (match_operand:SI 1 "register_operand" "")
+ (const_int 0))]
+ "ISA_HAS_DSPR2
+ && !TARGET_MIPS16
+ && !TARGET_64BIT
+ && true_regnum (operands[0]) >= DSP_ACC_REG_FIRST
+ && true_regnum (operands[0]) <= DSP_ACC_REG_LAST
+ && true_regnum (operands[0]) / 2 == true_regnum (operands[1]) / 2"
+ [(parallel [(set (match_dup 0) (const_int 0))
+ (set (match_dup 1) (const_int 0))
+ (unspec [(const_int 0)] UNSPEC_ACC_INIT)])]
+)
+
+(define_insn "*mips_acc_init"
+ [(parallel
+ [(set (match_operand:SI 0 "register_operand" "=a") (const_int 0))
+ (set (match_operand:SI 1 "register_operand" "=a") (const_int 0))
+ (unspec [(const_int 0)] UNSPEC_ACC_INIT)])]
+ "ISA_HAS_DSPR2
+ && !TARGET_MIPS16
+ && !TARGET_64BIT"
+ "mult\t%q0,$0,$0\t\t# Clear ACC HI/LO"
+ [(set_attr "type" "imul")
+ (set_attr "mode" "SI")])
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-08-16 15:13 [PATCH, MIPS] add new peephole for 74k dspr2 Sandra Loosemore
@ 2012-08-19 17:23 ` Richard Sandiford
2012-08-22 14:26 ` Sandra Loosemore
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-08-19 17:23 UTC (permalink / raw)
To: Sandra Loosemore; +Cc: gcc-patches
Sandra Loosemore <sandra@codesourcery.com> writes:
> This patch adds a peephole optimization to use a clever trick to
> zero-initialize the two halves of an accumulator register with one
> instruction instead of a mtlo/mthi pair. OK to check in?
>
> -Sandra
>
> 2012-08-16 Sandra Loosemore <sandra@codesourcery.com>
> Julian Brown <julian@codesourcery.com>
> MIPS Technologies, Inc.
>
> gcc/
> * config/mips/mips-dspr2.md (UNSPEC_ACC_INIT): Declare.
> (mult peephole2): Add peephole that converts
> "mtlo $ac[1-3],$0; mthi $ac[1-3],$0" into
> "mult $ac[1-3],$0,$0".
> (*mips_acc_init): New insn for above.
Not sure whether a peephole is the right choice here. In practice,
I'd imagine these opportunities would only come from a DImode move of
$0 into a doubleword register, so we could simply emit the pattern in
mips_split_doubleword_move.
That would also allow us to use it for plain HI and LO. It wasn't
obvious from the patch why it was restricted to the DSP extension
registers.
Please also add a scan-assembler test.
Richard
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-08-19 17:23 ` Richard Sandiford
@ 2012-08-22 14:26 ` Sandra Loosemore
2012-08-27 16:37 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Sandra Loosemore @ 2012-08-22 14:26 UTC (permalink / raw)
To: gcc-patches, rdsandiford
[-- Attachment #1: Type: text/plain, Size: 756 bytes --]
On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>
> Not sure whether a peephole is the right choice here. In practice,
> I'd imagine these opportunities would only come from a DImode move of
> $0 into a doubleword register, so we could simply emit the pattern in
> mips_split_doubleword_move.
>
> That would also allow us to use it for plain HI and LO. It wasn't
> obvious from the patch why it was restricted to the DSP extension
> registers.
>
> Please also add a scan-assembler test.
How is this version of the fix?
-Sandra
2012-08-22 Sandra Loosemore <sandra@codesourcery.com>
gcc/
* mips.c (mips_split_doubleword_move): Use mult instruction to
zero-initialize accumulator.
gcc/testsuite/
* gcc.target/mips/mips32-dsp-accinit.c: New.
[-- Attachment #2: accinit.patch --]
[-- Type: text/x-patch, Size: 1524 bytes --]
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c (revision 190463)
+++ gcc/config/mips/mips.c (working copy)
@@ -4158,6 +4158,14 @@ mips_split_doubleword_move (rtx dest, rt
else
emit_insn (gen_mfhisi_di (mips_subword (dest, true), src));
}
+ else if (!TARGET_64BIT && !TARGET_MIPS16 && ISA_HAS_DSP_MULT
+ && src == const0_rtx
+ && REG_P (dest) && ACC_REG_P (REGNO (dest)))
+ /* Zero-initialize accumulator using "mult $dest,$0,$0" instead
+ of a mthi/mtlo pair. */
+ emit_insn (gen_mulsidi3_32bit (dest,
+ gen_rtx_REG (SImode, GP_REG_FIRST),
+ gen_rtx_REG (SImode, GP_REG_FIRST)));
else
{
/* The operation can be split into two normal moves. Decide in
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c
===================================================================
--- gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c (revision 0)
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c (revision 0)
@@ -0,0 +1,16 @@
+/* { dg-options "-O2 -march=74kc -mgp32" } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+ the madd is done by means of a mult instruction instead of mthi/mtlo. */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+ long long result = 0;
+ int i;
+
+ for (i = 0; i < n; i++)
+ result = __builtin_mips_madd (result, v[i], m);
+ return result;
+}
+
+/* { dg-final { scan-assembler "mult\t\\\$ac.,\\\$0,\\\$0" } } */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-08-22 14:26 ` Sandra Loosemore
@ 2012-08-27 16:37 ` Richard Sandiford
2012-09-18 17:18 ` PING " Sandra Loosemore
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-08-27 16:37 UTC (permalink / raw)
To: Sandra Loosemore; +Cc: gcc-patches
Sandra Loosemore <sandra@codesourcery.com> writes:
> On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>>
>> Not sure whether a peephole is the right choice here. In practice,
>> I'd imagine these opportunities would only come from a DImode move of
>> $0 into a doubleword register, so we could simply emit the pattern in
>> mips_split_doubleword_move.
>>
>> That would also allow us to use it for plain HI and LO. It wasn't
>> obvious from the patch why it was restricted to the DSP extension
>> registers.
>>
>> Please also add a scan-assembler test.
>
> How is this version of the fix?
Just to say that I've not forgotten about this. I'd still like to
remove the !TARGET_64BIT and ISA_HAS_DSP_MULT tests, because the
idea isn't specific to either.
Also, reviewing the patch made me realise that it might be better
to keep the move intact and simply use "mult" in the output code.
That's my fault for suggesting the wrong thing, though, so I was
hoping to find time this weekend to try it myself. The testsuite
stuff ended up taking up all the available time instead.
Richard
^ permalink raw reply [flat|nested] 16+ messages in thread
* PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-08-27 16:37 ` Richard Sandiford
@ 2012-09-18 17:18 ` Sandra Loosemore
2012-09-18 18:17 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Sandra Loosemore @ 2012-09-18 17:18 UTC (permalink / raw)
To: gcc-patches, rdsandiford
On 08/27/2012 10:36 AM, Richard Sandiford wrote:
> Sandra Loosemore<sandra@codesourcery.com> writes:
>> On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>>>
>>> Not sure whether a peephole is the right choice here. In practice,
>>> I'd imagine these opportunities would only come from a DImode move of
>>> $0 into a doubleword register, so we could simply emit the pattern in
>>> mips_split_doubleword_move.
>>>
>>> That would also allow us to use it for plain HI and LO. It wasn't
>>> obvious from the patch why it was restricted to the DSP extension
>>> registers.
>>>
>>> Please also add a scan-assembler test.
>>
>> How is this version of the fix?
>
> Just to say that I've not forgotten about this. I'd still like to
> remove the !TARGET_64BIT and ISA_HAS_DSP_MULT tests, because the
> idea isn't specific to either.
>
> Also, reviewing the patch made me realise that it might be better
> to keep the move intact and simply use "mult" in the output code.
> That's my fault for suggesting the wrong thing, though, so I was
> hoping to find time this weekend to try it myself. The testsuite
> stuff ended up taking up all the available time instead.
Richard,
Have you had time to think about this some more? I am not sure I can
guess how you'd like me to fix this patch now without some more specific
review and/or suggestions about where the optimization should happen and
what cases it should be extended to detect in addition to the dsp
accumulator multiplies.
-Sandra
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-18 17:18 ` PING " Sandra Loosemore
@ 2012-09-18 18:17 ` Richard Sandiford
2012-09-24 15:49 ` Maciej W. Rozycki
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-18 18:17 UTC (permalink / raw)
To: Sandra Loosemore; +Cc: gcc-patches
Sandra Loosemore <sandra@codesourcery.com> writes:
> On 08/27/2012 10:36 AM, Richard Sandiford wrote:
>> Sandra Loosemore<sandra@codesourcery.com> writes:
>>> On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>>>>
>>>> Not sure whether a peephole is the right choice here. In practice,
>>>> I'd imagine these opportunities would only come from a DImode move of
>>>> $0 into a doubleword register, so we could simply emit the pattern in
>>>> mips_split_doubleword_move.
>>>>
>>>> That would also allow us to use it for plain HI and LO. It wasn't
>>>> obvious from the patch why it was restricted to the DSP extension
>>>> registers.
>>>>
>>>> Please also add a scan-assembler test.
>>>
>>> How is this version of the fix?
>>
>> Just to say that I've not forgotten about this. I'd still like to
>> remove the !TARGET_64BIT and ISA_HAS_DSP_MULT tests, because the
>> idea isn't specific to either.
>>
>> Also, reviewing the patch made me realise that it might be better
>> to keep the move intact and simply use "mult" in the output code.
>> That's my fault for suggesting the wrong thing, though, so I was
>> hoping to find time this weekend to try it myself. The testsuite
>> stuff ended up taking up all the available time instead.
>
> Richard,
>
> Have you had time to think about this some more? I am not sure I can
> guess how you'd like me to fix this patch now without some more specific
> review and/or suggestions about where the optimization should happen and
> what cases it should be extended to detect in addition to the dsp
> accumulator multiplies.
The patch below is the one I've been testing. But I got sidetracked
by looking into the possibility of removing the MD0_REG and MD1_REG
classes, in order to get more sensible costs. I think that was needed
for the madd-9.c test to pass.
Richard
Index: gcc/config/mips/mips-protos.h
===================================================================
--- gcc/config/mips/mips-protos.h 2012-09-03 07:49:57.319188985 +0100
+++ gcc/config/mips/mips-protos.h 2012-09-04 20:15:10.240130458 +0100
@@ -212,8 +212,8 @@ extern int m16_simm8_8 (rtx, enum machin
extern int m16_nsimm8_8 (rtx, enum machine_mode);
extern rtx mips_subword (rtx, bool);
-extern bool mips_split_64bit_move_p (rtx, rtx);
-extern void mips_split_doubleword_move (rtx, rtx);
+extern bool mips_split_move_p (rtx, rtx);
+extern void mips_split_move (rtx, rtx);
extern const char *mips_output_move (rtx, rtx);
extern bool mips_cfun_has_cprestore_slot_p (void);
extern bool mips_cprestore_address_p (rtx, bool);
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c 2012-09-04 20:15:08.191130518 +0100
+++ gcc/config/mips/mips.c 2012-09-04 20:15:17.173130256 +0100
@@ -2395,11 +2395,11 @@ mips_load_store_insns (rtx mem, rtx insn
mode = GET_MODE (mem);
/* Try to prove that INSN does not need to be split. */
- might_split_p = true;
- if (GET_MODE_BITSIZE (mode) == 64)
+ might_split_p = GET_MODE_SIZE (mode) > UNITS_PER_WORD;
+ if (might_split_p)
{
set = single_set (insn);
- if (set && !mips_split_64bit_move_p (SET_DEST (set), SET_SRC (set)))
+ if (set && !mips_split_move_p (SET_DEST (set), SET_SRC (set)))
might_split_p = false;
}
@@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
return simplify_gen_subreg (word_mode, op, mode, byte);
}
-/* Return true if a 64-bit move from SRC to DEST should be split into two. */
+/* Return true if SRC can be moved into DEST using MULT $0, $0. */
+
+static bool
+mips_mult_move_p (rtx dest, rtx src)
+{
+ return (src == const0_rtx
+ && REG_P (dest)
+ && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
+ && (ISA_HAS_DSP_MULT
+ ? ACC_REG_P (REGNO (dest))
+ : MD_REG_P (REGNO (dest))));
+}
+
+/* Return true if a move from SRC to DEST should be split into two. */
bool
-mips_split_64bit_move_p (rtx dest, rtx src)
+mips_split_move_p (rtx dest, rtx src)
{
- if (TARGET_64BIT)
+ /* Check whether the move can be done using some variant of MULT $0,$0. */
+ if (mips_mult_move_p (dest, src))
return false;
/* FPR-to-FPR moves can be done in a single instruction, if they're
allowed at all. */
- if (FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
+ unsigned int size = GET_MODE_SIZE (GET_MODE (dest));
+ if (size == 8 && FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
return false;
/* Check for floating-point loads and stores. */
- if (ISA_HAS_LDC1_SDC1)
+ if (size == 8 && ISA_HAS_LDC1_SDC1)
{
if (FP_REG_RTX_P (dest) && MEM_P (src))
return false;
if (FP_REG_RTX_P (src) && MEM_P (dest))
return false;
}
- return true;
+
+ /* Otherwise split all multiword moves. */
+ return size > UNITS_PER_WORD;
}
-/* Split a doubleword move from SRC to DEST. On 32-bit targets,
- this function handles 64-bit moves for which mips_split_64bit_move_p
- holds. For 64-bit targets, this function handles 128-bit moves. */
+/* Split a move from SRC to DEST, given that mips_split_move_p holds. */
void
-mips_split_doubleword_move (rtx dest, rtx src)
+mips_split_move (rtx dest, rtx src)
{
rtx low_dest;
+ gcc_assert (mips_split_move_p (dest, src));
if (FP_REG_RTX_P (dest) || FP_REG_RTX_P (src))
{
if (!TARGET_64BIT && GET_MODE (dest) == DImode)
@@ -4209,7 +4225,7 @@ mips_output_move (rtx dest, rtx src)
mode = GET_MODE (dest);
dbl_p = (GET_MODE_SIZE (mode) == 8);
- if (dbl_p && mips_split_64bit_move_p (dest, src))
+ if (mips_split_move_p (dest, src))
return "#";
if ((src_code == REG && GP_REG_P (REGNO (src)))
@@ -4220,6 +4236,14 @@ mips_output_move (rtx dest, rtx src)
if (GP_REG_P (REGNO (dest)))
return "move\t%0,%z1";
+ if (mips_mult_move_p (dest, src))
+ {
+ if (ISA_HAS_DSP_MULT)
+ return "mult\t%q0,%.,%.";
+ else
+ return "mult\t%.,%.";
+ }
+
/* Moves to HI are handled by special .md insns. */
if (REGNO (dest) == LO_REGNUM)
return "mtlo\t%z1";
@@ -10430,8 +10454,8 @@ mips_save_reg (rtx reg, rtx mem)
{
rtx x1, x2;
- if (mips_split_64bit_move_p (mem, reg))
- mips_split_doubleword_move (mem, reg);
+ if (mips_split_move_p (mem, reg))
+ mips_split_move (mem, reg);
else
mips_emit_move (mem, reg);
Index: gcc/config/mips/mips.md
===================================================================
--- gcc/config/mips/mips.md 2012-09-03 07:49:57.318188985 +0100
+++ gcc/config/mips/mips.md 2012-09-04 20:15:10.293130456 +0100
@@ -204,7 +204,7 @@ (define_attr "jal_macro" "no,yes"
;; the split instructions; in some cases, it is more appropriate for the
;; scheduling type to be "multi" instead.
(define_attr "move_type"
- "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,move,fmove,
+ "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,imul,move,fmove,
const,constN,signext,ext_ins,logical,arith,sll0,andi,loadpool,
shift_shift"
(const_string "unknown"))
@@ -369,6 +369,7 @@ (define_attr "type"
(eq_attr "move_type" "mflo") (const_string "mflo")
;; These types of move are always single insns.
+ (eq_attr "move_type" "imul") (const_string "imul")
(eq_attr "move_type" "fmove") (const_string "fmove")
(eq_attr "move_type" "loadpool") (const_string "load")
(eq_attr "move_type" "signext") (const_string "signext")
@@ -4254,13 +4255,13 @@ (define_insn "*mov<mode>_ra"
(set_attr "mode" "<MODE>")])
(define_insn "*movdi_32bit"
- [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
- (match_operand:DI 1 "move_operand" "d,i,m,d,*J*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
+ [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
+ (match_operand:DI 1 "move_operand" "d,i,m,d,*J,*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
"!TARGET_64BIT && !TARGET_MIPS16
&& (register_operand (operands[0], DImode)
|| reg_or_0_operand (operands[1], DImode))"
{ return mips_output_move (operands[0], operands[1]); }
- [(set_attr "move_type" "move,const,load,store,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
+ [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
(set_attr "mode" "DI")])
(define_insn "*movdi_32bit_mips16"
@@ -4707,14 +4708,14 @@ (define_expand "movti"
})
(define_insn "*movti"
- [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d")
- (match_operand:TI 1 "move_operand" "d,i,m,dJ,*d*J,*a"))]
+ [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d")
+ (match_operand:TI 1 "move_operand" "d,i,m,dJ,*J,*d,*a"))]
"TARGET_64BIT
&& !TARGET_MIPS16
&& (register_operand (operands[0], TImode)
|| reg_or_0_operand (operands[1], TImode))"
- "#"
- [(set_attr "move_type" "move,const,load,store,mtlo,mflo")
+ { return mips_output_move (operands[0], operands[1]); }
+ [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo")
(set_attr "mode" "TI")])
(define_insn "*movti_mips16"
@@ -4765,21 +4766,20 @@ (define_insn "*movtf_mips16"
(define_split
[(set (match_operand:MOVE64 0 "nonimmediate_operand")
(match_operand:MOVE64 1 "move_operand"))]
- "reload_completed && !TARGET_64BIT
- && mips_split_64bit_move_p (operands[0], operands[1])"
+ "reload_completed && mips_split_move_p (operands[0], operands[1])"
[(const_int 0)]
{
- mips_split_doubleword_move (operands[0], operands[1]);
+ mips_split_move (operands[0], operands[1]);
DONE;
})
(define_split
[(set (match_operand:MOVE128 0 "nonimmediate_operand")
(match_operand:MOVE128 1 "move_operand"))]
- "TARGET_64BIT && reload_completed"
+ "reload_completed && mips_split_move_p (operands[0], operands[1])"
[(const_int 0)]
{
- mips_split_doubleword_move (operands[0], operands[1]);
+ mips_split_move (operands[0], operands[1]);
DONE;
})
Index: gcc/testsuite/gcc.target/mips/madd-9.c
===================================================================
--- gcc/testsuite/gcc.target/mips/madd-9.c 2012-09-03 07:49:57.318188985 +0100
+++ gcc/testsuite/gcc.target/mips/madd-9.c 2012-09-04 21:21:17.132015119 +0100
@@ -1,7 +1,13 @@
/* { dg-do compile } */
/* { dg-options "isa_rev>=1 -mgp32" } */
-/* { dg-skip-if "code quality test" { *-*-* } { "-O0" } { "" } } */
+/* References to X within the loop need to have a higher frequency than
+ references to X outside the loop, otherwise there is no reason
+ to prefer multiply/accumulator registers over GPRs. */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
/* { dg-final { scan-assembler-not "\tmul\t" } } */
+/* { dg-final { scan-assembler-not "\tmthi" } } */
+/* { dg-final { scan-assembler-not "\tmtlo" } } */
+/* { dg-final { scan-assembler "\tmult\t" } } */
/* { dg-final { scan-assembler "\tmadd\t" } } */
NOMIPS16 long long
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c
===================================================================
--- /dev/null 2012-08-19 20:42:12.842999468 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c 2012-09-04 21:04:52.697043742 +0100
@@ -0,0 +1,20 @@
+/* { dg-options "-mdspr2 -mgp32" } */
+/* References to RESULT within the loop need to have a higher frequency than
+ references to RESULT outside the loop, otherwise there is no reason
+ to prefer multiply/accumulator registers over GPRs. */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+ the madd is done by means of a mult instruction instead of mthi/mtlo. */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+ long long result = 0;
+ int i;
+
+ for (i = 0; i < n; i++)
+ result = __builtin_mips_madd (result, v[i], m);
+ return result;
+}
+
+/* { dg-final { scan-assembler "\tmult\t\\\$ac.,\\\$0,\\\$0" } } */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-18 18:17 ` Richard Sandiford
@ 2012-09-24 15:49 ` Maciej W. Rozycki
2012-09-24 21:40 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-09-24 15:49 UTC (permalink / raw)
To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches
On Tue, 18 Sep 2012, Richard Sandiford wrote:
> > Have you had time to think about this some more? I am not sure I can
> > guess how you'd like me to fix this patch now without some more specific
> > review and/or suggestions about where the optimization should happen and
> > what cases it should be extended to detect in addition to the dsp
> > accumulator multiplies.
>
> The patch below is the one I've been testing. But I got sidetracked
> by looking into the possibility of removing the MD0_REG and MD1_REG
> classes, in order to get more sensible costs. I think that was needed
> for the madd-9.c test to pass.
Sorry to come up with this so late -- I have only now noticed this being
discussed.
> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
> return simplify_gen_subreg (word_mode, op, mode, byte);
> }
>
> -/* Return true if a 64-bit move from SRC to DEST should be split into two. */
> +/* Return true if SRC can be moved into DEST using MULT $0, $0. */
> +
> +static bool
> +mips_mult_move_p (rtx dest, rtx src)
> +{
> + return (src == const0_rtx
> + && REG_P (dest)
> + && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
> + && (ISA_HAS_DSP_MULT
> + ? ACC_REG_P (REGNO (dest))
> + : MD_REG_P (REGNO (dest))));
> +}
> +
> +/* Return true if a move from SRC to DEST should be split into two. */
Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower
than MTHI $0/MTLO $0? The latency of multiplication varies among
implementations, for example the original R3000 took 12 cycles (of course
the R3000 itself is not relevant for this change, but you see the
picture!). On the other hand in some (but not all!) processors
multiplication runs in parallel to the main pipeline so it is the
difference, if positive, between the number of cycles consumed by other
instructions up to the next HI/LO access instruction and the latency of
MULT run in the background that matters.
From the context I am assuming none of this matters for the 74K (and
presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall
isn't it something that should be decided based on instruction costs from
DFA schedulers? Is there anything that I've missed here? It doesn't
appear to me your (and neither the original) proposal takes instruction
cost calculation into consideration.
Maciej
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-24 15:49 ` Maciej W. Rozycki
@ 2012-09-24 21:40 ` Richard Sandiford
2012-09-25 0:52 ` Maciej W. Rozycki
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-24 21:40 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Tue, 18 Sep 2012, Richard Sandiford wrote:
>
>> > Have you had time to think about this some more? I am not sure I can
>> > guess how you'd like me to fix this patch now without some more specific
>> > review and/or suggestions about where the optimization should happen and
>> > what cases it should be extended to detect in addition to the dsp
>> > accumulator multiplies.
>>
>> The patch below is the one I've been testing. But I got sidetracked
>> by looking into the possibility of removing the MD0_REG and MD1_REG
>> classes, in order to get more sensible costs. I think that was needed
>> for the madd-9.c test to pass.
>
> Sorry to come up with this so late -- I have only now noticed this being
> discussed.
>
>> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>> return simplify_gen_subreg (word_mode, op, mode, byte);
>> }
>>
>> -/* Return true if a 64-bit move from SRC to DEST should be split into two. */
>> +/* Return true if SRC can be moved into DEST using MULT $0, $0. */
>> +
>> +static bool
>> +mips_mult_move_p (rtx dest, rtx src)
>> +{
>> + return (src == const0_rtx
>> + && REG_P (dest)
>> + && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
>> + && (ISA_HAS_DSP_MULT
>> + ? ACC_REG_P (REGNO (dest))
>> + : MD_REG_P (REGNO (dest))));
>> +}
>> +
>> +/* Return true if a move from SRC to DEST should be split into two. */
>
> Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower
> than MTHI $0/MTLO $0? The latency of multiplication varies among
> implementations, for example the original R3000 took 12 cycles (of course
> the R3000 itself is not relevant for this change, but you see the
> picture!). On the other hand in some (but not all!) processors
> multiplication runs in parallel to the main pipeline so it is the
> difference, if positive, between the number of cycles consumed by other
> instructions up to the next HI/LO access instruction and the latency of
> MULT run in the background that matters.
>
> From the context I am assuming none of this matters for the 74K (and
> presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall
> isn't it something that should be decided based on instruction costs from
> DFA schedulers? Is there anything that I've missed here? It doesn't
> appear to me your (and neither the original) proposal takes instruction
> cost calculation into consideration.
In practice, we only move 0 into HI and LO for MADD- and MSUB-style
operations. We deliberately don't use HI and LO as scratch space.
I think it's a reasonable default assumption that anything that supports
those instructions also has a fast path from MULT to MADD or MULT to MSUB.
I certainly don't know of any counter-examples. The decision is deliberately
centeralised in one place so that the condition can be tweaked in future
if necessary.
Richard
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-24 21:40 ` Richard Sandiford
@ 2012-09-25 0:52 ` Maciej W. Rozycki
2012-09-25 8:38 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-09-25 0:52 UTC (permalink / raw)
To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches
On Mon, 24 Sep 2012, Richard Sandiford wrote:
> > From the context I am assuming none of this matters for the 74K (and
> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall
> > isn't it something that should be decided based on instruction costs from
> > DFA schedulers? Is there anything that I've missed here? It doesn't
> > appear to me your (and neither the original) proposal takes instruction
> > cost calculation into consideration.
>
> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
> operations. We deliberately don't use HI and LO as scratch space.
>
> I think it's a reasonable default assumption that anything that supports
> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4
cycles on that processor). An MTHI/MTLO pair will take 2 cycles;
obviously the resulting larger code may adversely affect cache performance
in some scenarios.
> I certainly don't know of any counter-examples. The decision is deliberately
> centeralised in one place so that the condition can be tweaked in future
> if necessary.
Sure, the cost comparison could be done in that single place as well.
Maciej
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-25 0:52 ` Maciej W. Rozycki
@ 2012-09-25 8:38 ` Richard Sandiford
2012-09-25 10:29 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-25 8:38 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Mon, 24 Sep 2012, Richard Sandiford wrote:
>
>> > From the context I am assuming none of this matters for the 74K (and
>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall
>> > isn't it something that should be decided based on instruction costs from
>> > DFA schedulers? Is there anything that I've missed here? It doesn't
>> > appear to me your (and neither the original) proposal takes instruction
>> > cost calculation into consideration.
>>
>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
>> operations. We deliberately don't use HI and LO as scratch space.
>>
>> I think it's a reasonable default assumption that anything that supports
>> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
>
> According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4
> cycles on that processor). An MTHI/MTLO pair will take 2 cycles;
> obviously the resulting larger code may adversely affect cache performance
> in some scenarios.
That's not how the 4650 DFA models it though.
(define_insn_reservation "generic_hilo" 1
(eq_attr "type" "mfhi,mflo,mthi,mtlo")
"imuldiv*3")
(define_insn_reservation "r4650_imul" 4
(and (eq_attr "cpu" "r4650")
(eq_attr "type" "imul,imul3,imadd"))
"imuldiv*4")
So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
rather than 4 cycles. Any attempt to use the DFA would still favour MULT.
Richard
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-25 8:38 ` Richard Sandiford
@ 2012-09-25 10:29 ` Richard Sandiford
2012-09-25 18:06 ` Maciej W. Rozycki
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-09-25 10:29 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches
Richard Sandiford <rdsandiford@googlemail.com> writes:
> "Maciej W. Rozycki" <macro@codesourcery.com> writes:
>> On Mon, 24 Sep 2012, Richard Sandiford wrote:
>>
>>> > From the context I am assuming none of this matters for the 74K (and
>>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall
>>> > isn't it something that should be decided based on instruction costs from
>>> > DFA schedulers? Is there anything that I've missed here? It doesn't
>>> > appear to me your (and neither the original) proposal takes instruction
>>> > cost calculation into consideration.
>>>
>>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
>>> operations. We deliberately don't use HI and LO as scratch space.
>>>
>>> I think it's a reasonable default assumption that anything that supports
>>> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
>>
>> According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4
>> cycles on that processor). An MTHI/MTLO pair will take 2 cycles;
>> obviously the resulting larger code may adversely affect cache performance
>> in some scenarios.
>
> That's not how the 4650 DFA models it though.
>
> (define_insn_reservation "generic_hilo" 1
> (eq_attr "type" "mfhi,mflo,mthi,mtlo")
> "imuldiv*3")
>
> (define_insn_reservation "r4650_imul" 4
> (and (eq_attr "cpu" "r4650")
> (eq_attr "type" "imul,imul3,imadd"))
> "imuldiv*4")
>
> So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> rather than 4 cycles. Any attempt to use the DFA would still favour MULT.
Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
MULT $0,$0 would be a really bad choice. Sigh. The amount of effort
required for this optimisation is getting a bit ridiculous.
Richard
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-25 10:29 ` Richard Sandiford
@ 2012-09-25 18:06 ` Maciej W. Rozycki
2012-10-07 8:45 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-09-25 18:06 UTC (permalink / raw)
To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches
On Tue, 25 Sep 2012, Richard Sandiford wrote:
> >> According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4
> >> cycles on that processor). An MTHI/MTLO pair will take 2 cycles;
> >> obviously the resulting larger code may adversely affect cache performance
> >> in some scenarios.
> >
> > That's not how the 4650 DFA models it though.
> >
> > (define_insn_reservation "generic_hilo" 1
> > (eq_attr "type" "mfhi,mflo,mthi,mtlo")
> > "imuldiv*3")
> >
> > (define_insn_reservation "r4650_imul" 4
> > (and (eq_attr "cpu" "r4650")
> > (eq_attr "type" "imul,imul3,imadd"))
> > "imuldiv*4")
> >
> > So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> > rather than 4 cycles. Any attempt to use the DFA would still favour MULT.
I can't track a reference on R4650 MTHI/MTLO latency; I'd be happy to
learn of one, or otherwise I wonder where the delay is coming from. Also
a small update: apparently MULT is 3 clocks only on the R4650 where
operands are 16 bits (unsure if it is enough if only one is; for a zero by
zero multiplication it surely does not matter though). So I think using a
MULT here is at least reasonable.
> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
> MULT $0,$0 would be a really bad choice. Sigh. The amount of effort
> required for this optimisation is getting a bit ridiculous.
I have double-checked some documentation, and in fact many MIPS cores,
including the current ones, have a configuration option to include either
a high-performance or an area-efficient MD unit. Take the M14Kc for
example -- its high-performance unit has a one-cycle latency/issue rate
for 16-bit multiplication (two-cycle for full 32 bits; here the width of
rt is explicitly named) and the area-efficient has a 32-cycle
latency/issue rate only regardless of the operand size (obviously
iterating over addition one bit at a time). The latency of MTHI/MTLO is 1
across both units.
So I think this can't really be selected automatically for all cores,
some human-supplied knowledge about the MD unit used is required -- that
obviously affects other operations too, e.g. some multiplications
involving a constant that may be cheaper to do either directly or with a
sequence of additions depending on the MD unit present (unless optimising
for size, of course).
Maciej
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-09-25 18:06 ` Maciej W. Rozycki
@ 2012-10-07 8:45 ` Richard Sandiford
2012-10-08 23:23 ` Maciej W. Rozycki
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-10-07 8:45 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Tue, 25 Sep 2012, Richard Sandiford wrote:
>> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
>> MULT $0,$0 would be a really bad choice. Sigh. The amount of effort
>> required for this optimisation is getting a bit ridiculous.
>
> I have double-checked some documentation, and in fact many MIPS cores,
> including the current ones, have a configuration option to include either
> a high-performance or an area-efficient MD unit. Take the M14Kc for
> example -- its high-performance unit has a one-cycle latency/issue rate
> for 16-bit multiplication (two-cycle for full 32 bits; here the width of
> rt is explicitly named) and the area-efficient has a 32-cycle
> latency/issue rate only regardless of the operand size (obviously
> iterating over addition one bit at a time). The latency of MTHI/MTLO is 1
> across both units.
>
> So I think this can't really be selected automatically for all cores,
> some human-supplied knowledge about the MD unit used is required -- that
> obviously affects other operations too, e.g. some multiplications
> involving a constant that may be cheaper to do either directly or with a
> sequence of additions depending on the MD unit present (unless optimising
> for size, of course).
Yeah, as far as this optimisation goes, I think your original suggestion
of using the DFA to work out the best sequence is still right. If the DFA
says that multiplications take a handful of cycles when they actually take
32 then we're not going to optimise multiplication very well whatever happens.
In practice, code destined for non-4kc cores with iterative multipliers
would probably tune well with -mtune=4kp.
Anyway, I said upthread that I was looking into removing MD0_REG and
MD1_REG because I thought I'd seen a case where that was needed for
madd-9.c to pass. I can no longer reproduce that, so it was probably
pilot error.
The final patch ended up being much more complicated than I'd have liked,
but at least it should be easier to add other automatically-derived tuning
costs in future.
Tested on mipsisa64-elf and mipsisa32-elf. Applied.
Richard
gcc/
* config/mips/mips-protos.h (mips_split_type): New enum.
(mips_split_64bit_move_p, mips_split_doubleword_move): Delete.
(mips_split_move_p, mips_split_move, mips_split_move_insn_p)
(mips_split_move_insn): Declare.
* config/mips/mips.c (mips_tuning_info): New variable.
(mips_load_store_insns): Use mips_split_move_insn_p instead of
mips_split_64bit_move_p.
(mips_emit_move_or_split, mips_mult_move_p): New functions.
(mips_split_64bit_move_p): Rename to...
(mips_split_move_p): ...this and take a mips_split_type argument.
Generalize to all moves. Call mips_mult_move_p.
(mips_split_doubleword_move): Rename to...
(mips_split_move): ...this and take a mips_split_type argument.
Assert that mips_split_move_p holds.
(mips_insn_split_type, mips_split_move_insn_p, mips_split_move_insn):
New functions.
(mips_output_move): Use mips_split_move_p instead of
mips_split_64bit_move_p. Handle MULT $0, $0 moves.
(mips_save_reg): Use mips_emit_move_or_split.
(mips_sim_reset): Assign to curr_state. Call targetm.sched.init
and advance_state.
(mips_sim_init): Call targetm.sched.init_dfa_pre_cycle_insn and
targetm.sched.init_dfa_post_cycle_insn, if defined.
(mips_sim_next_cycle): Assign to curr_state. Use advance_state
instead of state_transition.
(mips_sim_issue_insn): Assign to curr_state. Use
targetm.sched.variable_issue to see how many more insns
can be issued.
(mips_seq_time, mips_mult_zero_zero_cost)
(mips_set_fast_mult_zero_zero_p, mips_set_tuning_info)
(mips_expand_to_rtl_hook): New functions.
(TARGET_EXPAND_TO_RTL_HOOK): Define.
* config/mips/mips.md (move_type): Add imul.
(type): Map imul move_types to imul.
(*movdi_32bit, *movti): Add imul alternatives.
Use mips_split_move_insn_p and mips_split_move_insn instead of
mips_split_64bit_move_p and mips_split_doubleword_move in move
splitters.
gcc/testsuite/
2012-10-07 Richard Sandiford <rdsandiford@googlemail.com>
Sandra Loosemore <sandra@codesourcery.com>
* gcc.target/mips/madd-9.c: Force code to be tuned for the 4kc
and test that the accumulator is initialized using MULT.
* gcc.target/mips/mips32-dsp-accinit-1.c: New test.
* gcc.target/mips/mips32-dsp-accinit-2.c: Likewise.
Index: gcc/config/mips/mips-protos.h
===================================================================
--- gcc/config/mips/mips-protos.h 2012-10-06 17:26:43.408618293 +0100
+++ gcc/config/mips/mips-protos.h 2012-10-07 08:21:40.892791332 +0100
@@ -173,6 +173,25 @@ enum mips_call_type {
MIPS_CALL_EPILOGUE
};
+/* Controls the conditions under which certain instructions are split.
+
+ SPLIT_IF_NECESSARY
+ Only perform splits that are necessary for correctness
+ (because no unsplit version exists).
+
+ SPLIT_FOR_SPEED
+ Perform splits that are necessary for correctness or
+ beneficial for code speed.
+
+ SPLIT_FOR_SIZE
+ Perform splits that are necessary for correctness or
+ beneficial for code size. */
+enum mips_split_type {
+ SPLIT_IF_NECESSARY,
+ SPLIT_FOR_SPEED,
+ SPLIT_FOR_SIZE
+};
+
extern bool mips_symbolic_constant_p (rtx, enum mips_symbol_context,
enum mips_symbol_type *);
extern int mips_regno_mode_ok_for_base_p (int, enum machine_mode, bool);
@@ -212,8 +231,10 @@ extern int m16_simm8_8 (rtx, enum machin
extern int m16_nsimm8_8 (rtx, enum machine_mode);
extern rtx mips_subword (rtx, bool);
-extern bool mips_split_64bit_move_p (rtx, rtx);
-extern void mips_split_doubleword_move (rtx, rtx);
+extern bool mips_split_move_p (rtx, rtx, enum mips_split_type);
+extern void mips_split_move (rtx, rtx, enum mips_split_type);
+extern bool mips_split_move_insn_p (rtx, rtx, rtx);
+extern void mips_split_move_insn (rtx, rtx, rtx);
extern const char *mips_output_move (rtx, rtx);
extern bool mips_cfun_has_cprestore_slot_p (void);
extern bool mips_cprestore_address_p (rtx, bool);
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c 2012-10-06 22:24:11.262714074 +0100
+++ gcc/config/mips/mips.c 2012-10-07 09:38:39.656797917 +0100
@@ -265,6 +265,24 @@ static const char *const mips_fp_conditi
MIPS_FP_CONDITIONS (STRINGIFY)
};
+/* Tuning information that is automatically derived from other sources
+ (such as the scheduler). */
+static struct {
+ /* The architecture and tuning settings that this structure describes. */
+ enum processor arch;
+ enum processor tune;
+
+ /* True if this structure describes MIPS16 settings. */
+ bool mips16_p;
+
+ /* True if the structure has been initialized. */
+ bool initialized_p;
+
+ /* True if "MULT $0, $0" is preferable to "MTLO $0; MTHI $0"
+ when optimizing for speed. */
+ bool fast_mult_zero_zero_p;
+} mips_tuning_info;
+
/* Information about a function's frame layout. */
struct GTY(()) mips_frame_info {
/* The size of the frame in bytes. */
@@ -2395,11 +2413,11 @@ mips_load_store_insns (rtx mem, rtx insn
mode = GET_MODE (mem);
/* Try to prove that INSN does not need to be split. */
- might_split_p = true;
- if (GET_MODE_BITSIZE (mode) == 64)
+ might_split_p = GET_MODE_SIZE (mode) > UNITS_PER_WORD;
+ if (might_split_p)
{
set = single_set (insn);
- if (set && !mips_split_64bit_move_p (SET_DEST (set), SET_SRC (set)))
+ if (set && !mips_split_move_insn_p (SET_DEST (set), SET_SRC (set), insn))
might_split_p = false;
}
@@ -2441,6 +2459,18 @@ mips_emit_move (rtx dest, rtx src)
: emit_move_insn_1 (dest, src));
}
+/* Emit a move from SRC to DEST, splitting compound moves into individual
+ instructions. SPLIT_TYPE is the type of split to perform. */
+
+static void
+mips_emit_move_or_split (rtx dest, rtx src, enum mips_split_type split_type)
+{
+ if (mips_split_move_p (dest, src, split_type))
+ mips_split_move (dest, src, split_type);
+ else
+ mips_emit_move (dest, src);
+}
+
/* Emit an instruction of the form (set TARGET (CODE OP0)). */
static void
@@ -4119,39 +4149,60 @@ mips_subword (rtx op, bool high_p)
return simplify_gen_subreg (word_mode, op, mode, byte);
}
-/* Return true if a 64-bit move from SRC to DEST should be split into two. */
+/* Return true if SRC should be moved into DEST using "MULT $0, $0".
+ SPLIT_TYPE is the condition under which moves should be split. */
+
+static bool
+mips_mult_move_p (rtx dest, rtx src, enum mips_split_type split_type)
+{
+ return ((split_type != SPLIT_FOR_SPEED
+ || mips_tuning_info.fast_mult_zero_zero_p)
+ && src == const0_rtx
+ && REG_P (dest)
+ && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
+ && (ISA_HAS_DSP_MULT
+ ? ACC_REG_P (REGNO (dest))
+ : MD_REG_P (REGNO (dest))));
+}
+
+/* Return true if a move from SRC to DEST should be split into two.
+ SPLIT_TYPE describes the split condition. */
bool
-mips_split_64bit_move_p (rtx dest, rtx src)
+mips_split_move_p (rtx dest, rtx src, enum mips_split_type split_type)
{
- if (TARGET_64BIT)
+ /* Check whether the move can be done using some variant of MULT $0,$0. */
+ if (mips_mult_move_p (dest, src, split_type))
return false;
/* FPR-to-FPR moves can be done in a single instruction, if they're
allowed at all. */
- if (FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
+ unsigned int size = GET_MODE_SIZE (GET_MODE (dest));
+ if (size == 8 && FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
return false;
/* Check for floating-point loads and stores. */
- if (ISA_HAS_LDC1_SDC1)
+ if (size == 8 && ISA_HAS_LDC1_SDC1)
{
if (FP_REG_RTX_P (dest) && MEM_P (src))
return false;
if (FP_REG_RTX_P (src) && MEM_P (dest))
return false;
}
- return true;
+
+ /* Otherwise split all multiword moves. */
+ return size > UNITS_PER_WORD;
}
-/* Split a doubleword move from SRC to DEST. On 32-bit targets,
- this function handles 64-bit moves for which mips_split_64bit_move_p
- holds. For 64-bit targets, this function handles 128-bit moves. */
+/* Split a move from SRC to DEST, given that mips_split_move_p holds.
+ SPLIT_TYPE describes the split condition. */
void
-mips_split_doubleword_move (rtx dest, rtx src)
+mips_split_move (rtx dest, rtx src, enum mips_split_type split_type)
{
rtx low_dest;
+ gcc_checking_assert (mips_split_move_p (dest, src, split_type));
if (FP_REG_RTX_P (dest) || FP_REG_RTX_P (src))
{
if (!TARGET_64BIT && GET_MODE (dest) == DImode)
@@ -4206,6 +4257,41 @@ mips_split_doubleword_move (rtx dest, rt
}
}
}
+
+/* Return the split type for instruction INSN. */
+
+static enum mips_split_type
+mips_insn_split_type (rtx insn)
+{
+ basic_block bb = BLOCK_FOR_INSN (insn);
+ if (bb)
+ {
+ if (optimize_bb_for_speed_p (bb))
+ return SPLIT_FOR_SPEED;
+ else
+ return SPLIT_FOR_SIZE;
+ }
+ /* Once CFG information has been removed, we should trust the optimization
+ decisions made by previous passes and only split where necessary. */
+ return SPLIT_IF_NECESSARY;
+}
+
+/* Return true if a move from SRC to DEST in INSN should be split. */
+
+bool
+mips_split_move_insn_p (rtx dest, rtx src, rtx insn)
+{
+ return mips_split_move_p (dest, src, mips_insn_split_type (insn));
+}
+
+/* Split a move from SRC to DEST in INSN, given that mips_split_move_insn_p
+ holds. */
+
+void
+mips_split_move_insn (rtx dest, rtx src, rtx insn)
+{
+ mips_split_move (dest, src, mips_insn_split_type (insn));
+}
\f
/* Return the appropriate instructions to move SRC into DEST. Assume
that SRC is operand 1 and DEST is operand 0. */
@@ -4223,7 +4309,7 @@ mips_output_move (rtx dest, rtx src)
mode = GET_MODE (dest);
dbl_p = (GET_MODE_SIZE (mode) == 8);
- if (dbl_p && mips_split_64bit_move_p (dest, src))
+ if (mips_split_move_p (dest, src, SPLIT_IF_NECESSARY))
return "#";
if ((src_code == REG && GP_REG_P (REGNO (src)))
@@ -4234,6 +4320,14 @@ mips_output_move (rtx dest, rtx src)
if (GP_REG_P (REGNO (dest)))
return "move\t%0,%z1";
+ if (mips_mult_move_p (dest, src, SPLIT_IF_NECESSARY))
+ {
+ if (ISA_HAS_DSP_MULT)
+ return "mult\t%q0,%.,%.";
+ else
+ return "mult\t%.,%.";
+ }
+
/* Moves to HI are handled by special .md insns. */
if (REGNO (dest) == LO_REGNUM)
return "mtlo\t%z1";
@@ -10444,10 +10538,7 @@ mips_save_reg (rtx reg, rtx mem)
{
rtx x1, x2;
- if (mips_split_64bit_move_p (mem, reg))
- mips_split_doubleword_move (mem, reg);
- else
- mips_emit_move (mem, reg);
+ mips_emit_move_or_split (mem, reg, SPLIT_IF_NECESSARY);
x1 = mips_frame_set (mips_subword (mem, false),
mips_subword (reg, false));
@@ -14907,10 +14998,15 @@ struct mips_sim {
static void
mips_sim_reset (struct mips_sim *state)
{
+ curr_state = state->dfa_state;
+
state->time = 0;
state->insns_left = state->issue_rate;
memset (&state->last_set, 0, sizeof (state->last_set));
- state_reset (state->dfa_state);
+ state_reset (curr_state);
+
+ targetm.sched.init (0, false, 0);
+ advance_state (curr_state);
}
/* Initialize STATE before its first use. DFA_STATE points to an
@@ -14919,6 +15015,12 @@ mips_sim_reset (struct mips_sim *state)
static void
mips_sim_init (struct mips_sim *state, state_t dfa_state)
{
+ if (targetm.sched.init_dfa_pre_cycle_insn)
+ targetm.sched.init_dfa_pre_cycle_insn ();
+
+ if (targetm.sched.init_dfa_post_cycle_insn)
+ targetm.sched.init_dfa_post_cycle_insn ();
+
state->issue_rate = mips_issue_rate ();
state->dfa_state = dfa_state;
mips_sim_reset (state);
@@ -14929,9 +15031,11 @@ mips_sim_init (struct mips_sim *state, s
static void
mips_sim_next_cycle (struct mips_sim *state)
{
+ curr_state = state->dfa_state;
+
state->time++;
state->insns_left = state->issue_rate;
- state_transition (state->dfa_state, 0);
+ advance_state (curr_state);
}
/* Advance simulation state STATE until instruction INSN can read
@@ -15037,8 +15141,11 @@ mips_sim_record_set (rtx x, const_rtx pa
static void
mips_sim_issue_insn (struct mips_sim *state, rtx insn)
{
- state_transition (state->dfa_state, insn);
- state->insns_left--;
+ curr_state = state->dfa_state;
+
+ state_transition (curr_state, insn);
+ state->insns_left = targetm.sched.variable_issue (0, false, insn,
+ state->insns_left);
mips_sim_insn = insn;
note_stores (PATTERN (insn), mips_sim_record_set, state);
@@ -15089,6 +15196,109 @@ mips_sim_finish_insn (struct mips_sim *s
break;
}
}
+
+/* Use simulator state STATE to calculate the execution time of
+ instruction sequence SEQ. */
+
+static unsigned int
+mips_seq_time (struct mips_sim *state, rtx seq)
+{
+ mips_sim_reset (state);
+ for (rtx insn = seq; insn; insn = NEXT_INSN (insn))
+ {
+ mips_sim_wait_insn (state, insn);
+ mips_sim_issue_insn (state, insn);
+ }
+ return state->time;
+}
+\f
+/* Return the execution-time cost of mips_tuning_info.fast_mult_zero_zero_p
+ setting SETTING, using STATE to simulate instruction sequences. */
+
+static unsigned int
+mips_mult_zero_zero_cost (struct mips_sim *state, bool setting)
+{
+ mips_tuning_info.fast_mult_zero_zero_p = setting;
+ start_sequence ();
+
+ enum machine_mode dword_mode = TARGET_64BIT ? TImode : DImode;
+ rtx hilo = gen_rtx_REG (dword_mode, MD_REG_FIRST);
+ mips_emit_move_or_split (hilo, const0_rtx, SPLIT_FOR_SPEED);
+
+ /* If the target provides mulsidi3_32bit then that's the most likely
+ consumer of the result. Test for bypasses. */
+ if (dword_mode == DImode && HAVE_maddsidi4)
+ {
+ rtx gpr = gen_rtx_REG (SImode, GP_REG_FIRST + 4);
+ emit_insn (gen_maddsidi4 (hilo, gpr, gpr, hilo));
+ }
+
+ unsigned int time = mips_seq_time (state, get_insns ());
+ end_sequence ();
+ return time;
+}
+
+/* Check the relative speeds of "MULT $0,$0" and "MTLO $0; MTHI $0"
+ and set up mips_tuning_info.fast_mult_zero_zero_p accordingly.
+ Prefer MULT -- which is shorter -- in the event of a tie. */
+
+static void
+mips_set_fast_mult_zero_zero_p (struct mips_sim *state)
+{
+ if (TARGET_MIPS16)
+ /* No MTLO or MTHI available. */
+ mips_tuning_info.fast_mult_zero_zero_p = true;
+ else
+ {
+ unsigned int true_time = mips_mult_zero_zero_cost (state, true);
+ unsigned int false_time = mips_mult_zero_zero_cost (state, false);
+ mips_tuning_info.fast_mult_zero_zero_p = (true_time <= false_time);
+ }
+}
+
+/* Set up costs based on the current architecture and tuning settings. */
+
+static void
+mips_set_tuning_info (void)
+{
+ if (mips_tuning_info.initialized_p
+ && mips_tuning_info.arch == mips_arch
+ && mips_tuning_info.tune == mips_tune
+ && mips_tuning_info.mips16_p == TARGET_MIPS16)
+ return;
+
+ mips_tuning_info.arch = mips_arch;
+ mips_tuning_info.tune = mips_tune;
+ mips_tuning_info.mips16_p = TARGET_MIPS16;
+ mips_tuning_info.initialized_p = true;
+
+ dfa_start ();
+
+ struct mips_sim state;
+ mips_sim_init (&state, alloca (state_size ()));
+
+ mips_set_fast_mult_zero_zero_p (&state);
+
+ dfa_finish ();
+}
+
+/* Implement TARGET_EXPAND_TO_RTL_HOOK. */
+
+static void
+mips_expand_to_rtl_hook (void)
+{
+ /* We need to call this at a point where we can safely create sequences
+ of instructions, so TARGET_OVERRIDE_OPTIONS is too early. We also
+ need to call it at a point where the DFA infrastructure is not
+ already in use, so we can't just call it lazily on demand.
+
+ At present, mips_tuning_info is only needed during post-expand
+ RTL passes such as split_insns, so this hook should be early enough.
+ We may need to move the call elsewhere if mips_tuning_info starts
+ to be used for other things (such as rtx_costs, or expanders that
+ could be called during gimple optimization). */
+ mips_set_tuning_info ();
+}
\f
/* The VR4130 pipeline issues aligned pairs of instructions together,
but it stalls the second instruction if it depends on the first.
@@ -17760,6 +17970,8 @@ #define TARGET_MACHINE_DEPENDENT_REORG m
#undef TARGET_PREFERRED_RELOAD_CLASS
#define TARGET_PREFERRED_RELOAD_CLASS mips_preferred_reload_class
+#undef TARGET_EXPAND_TO_RTL_HOOK
+#define TARGET_EXPAND_TO_RTL_HOOK mips_expand_to_rtl_hook
#undef TARGET_ASM_FILE_START
#define TARGET_ASM_FILE_START mips_file_start
#undef TARGET_ASM_FILE_START_FILE_DIRECTIVE
Index: gcc/config/mips/mips.md
===================================================================
--- gcc/config/mips/mips.md 2012-10-06 21:08:44.902431555 +0100
+++ gcc/config/mips/mips.md 2012-10-06 22:24:24.854713317 +0100
@@ -204,7 +204,7 @@ (define_attr "jal_macro" "no,yes"
;; the split instructions; in some cases, it is more appropriate for the
;; scheduling type to be "multi" instead.
(define_attr "move_type"
- "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,move,fmove,
+ "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,imul,move,fmove,
const,constN,signext,ext_ins,logical,arith,sll0,andi,loadpool,
shift_shift"
(const_string "unknown"))
@@ -369,6 +369,7 @@ (define_attr "type"
(eq_attr "move_type" "mflo") (const_string "mflo")
;; These types of move are always single insns.
+ (eq_attr "move_type" "imul") (const_string "imul")
(eq_attr "move_type" "fmove") (const_string "fmove")
(eq_attr "move_type" "loadpool") (const_string "load")
(eq_attr "move_type" "signext") (const_string "signext")
@@ -4242,14 +4243,17 @@ (define_insn "*mov<mode>_ra"
(set_attr "mode" "<MODE>")])
(define_insn "*movdi_32bit"
- [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
- (match_operand:DI 1 "move_operand" "d,i,m,d,*J*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
+ [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
+ (match_operand:DI 1 "move_operand" "d,i,m,d,*J,*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
"!TARGET_64BIT && !TARGET_MIPS16
&& (register_operand (operands[0], DImode)
|| reg_or_0_operand (operands[1], DImode))"
{ return mips_output_move (operands[0], operands[1]); }
- [(set_attr "move_type" "move,const,load,store,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
- (set_attr "mode" "DI")])
+ [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
+ (set (attr "mode")
+ (if_then_else (eq_attr "move_type" "imul")
+ (const_string "SI")
+ (const_string "DI")))])
(define_insn "*movdi_32bit_mips16"
[(set (match_operand:DI 0 "nonimmediate_operand" "=d,y,d,d,d,d,m,*d")
@@ -4695,15 +4699,18 @@ (define_expand "movti"
})
(define_insn "*movti"
- [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d")
- (match_operand:TI 1 "move_operand" "d,i,m,dJ,*d*J,*a"))]
+ [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d")
+ (match_operand:TI 1 "move_operand" "d,i,m,dJ,*J,*d,*a"))]
"TARGET_64BIT
&& !TARGET_MIPS16
&& (register_operand (operands[0], TImode)
|| reg_or_0_operand (operands[1], TImode))"
- "#"
- [(set_attr "move_type" "move,const,load,store,mtlo,mflo")
- (set_attr "mode" "TI")])
+ { return mips_output_move (operands[0], operands[1]); }
+ [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo")
+ (set (attr "mode")
+ (if_then_else (eq_attr "move_type" "imul")
+ (const_string "SI")
+ (const_string "TI")))])
(define_insn "*movti_mips16"
[(set (match_operand:TI 0 "nonimmediate_operand" "=d,y,d,d,d,d,m,*d")
@@ -4753,21 +4760,20 @@ (define_insn "*movtf_mips16"
(define_split
[(set (match_operand:MOVE64 0 "nonimmediate_operand")
(match_operand:MOVE64 1 "move_operand"))]
- "reload_completed && !TARGET_64BIT
- && mips_split_64bit_move_p (operands[0], operands[1])"
+ "reload_completed && mips_split_move_insn_p (operands[0], operands[1], insn)"
[(const_int 0)]
{
- mips_split_doubleword_move (operands[0], operands[1]);
+ mips_split_move_insn (operands[0], operands[1], curr_insn);
DONE;
})
(define_split
[(set (match_operand:MOVE128 0 "nonimmediate_operand")
(match_operand:MOVE128 1 "move_operand"))]
- "TARGET_64BIT && reload_completed"
+ "reload_completed && mips_split_move_insn_p (operands[0], operands[1], insn)"
[(const_int 0)]
{
- mips_split_doubleword_move (operands[0], operands[1]);
+ mips_split_move_insn (operands[0], operands[1], curr_insn);
DONE;
})
Index: gcc/testsuite/gcc.target/mips/madd-9.c
===================================================================
--- gcc/testsuite/gcc.target/mips/madd-9.c 2012-10-06 17:26:43.407618293 +0100
+++ gcc/testsuite/gcc.target/mips/madd-9.c 2012-10-06 22:24:24.855713317 +0100
@@ -1,7 +1,13 @@
/* { dg-do compile } */
-/* { dg-options "isa_rev>=1 -mgp32" } */
-/* { dg-skip-if "code quality test" { *-*-* } { "-O0" } { "" } } */
+/* { dg-options "isa_rev>=1 -mgp32 -mtune=4kc" } */
+/* References to X within the loop need to have a higher frequency than
+ references to X outside the loop, otherwise there is no reason
+ to prefer multiply/accumulator registers over GPRs. */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
/* { dg-final { scan-assembler-not "\tmul\t" } } */
+/* { dg-final { scan-assembler-not "\tmthi" } } */
+/* { dg-final { scan-assembler-not "\tmtlo" } } */
+/* { dg-final { scan-assembler "\tmult\t" } } */
/* { dg-final { scan-assembler "\tmadd\t" } } */
NOMIPS16 long long
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-1.c
===================================================================
--- /dev/null 2012-10-06 09:26:21.400998949 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-1.c 2012-10-06 22:24:24.856713316 +0100
@@ -0,0 +1,22 @@
+/* { dg-options "-mdspr2 -mgp32 -mtune=74kc" } */
+/* References to RESULT within the loop need to have a higher frequency than
+ references to RESULT outside the loop, otherwise there is no reason
+ to prefer multiply/accumulator registers over GPRs. */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+ the madd is done by means of a mult instruction instead of mthi/mtlo. */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+ long long result = 0;
+ int i;
+
+ for (i = 0; i < n; i++)
+ result = __builtin_mips_madd (result, v[i], m);
+ return result;
+}
+
+/* { dg-final { scan-assembler "\tmult\t\\\$ac.,\\\$0,\\\$0" } } */
+/* { dg-final { scan-assembler-not "mthi\t" } } */
+/* { dg-final { scan-assembler-not "mtlo\t" } } */
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
===================================================================
--- /dev/null 2012-10-06 09:26:21.400998949 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c 2012-10-06 22:24:24.857713316 +0100
@@ -0,0 +1,22 @@
+/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
+/* References to RESULT within the loop need to have a higher frequency than
+ references to RESULT outside the loop, otherwise there is no reason
+ to prefer multiply/accumulator registers over GPRs. */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+ the madd is done by means of a mult instruction instead of mthi/mtlo. */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+ long long result = 0;
+ int i;
+
+ for (i = 0; i < n; i++)
+ result = __builtin_mips_madd (result, v[i], m);
+ return result;
+}
+
+/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
+/* { dg-final { scan-assembler "\tmthi\t" } } */
+/* { dg-final { scan-assembler "\tmtlo\t" } } */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-10-07 8:45 ` Richard Sandiford
@ 2012-10-08 23:23 ` Maciej W. Rozycki
2012-10-10 20:00 ` Richard Sandiford
0 siblings, 1 reply; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-10-08 23:23 UTC (permalink / raw)
To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches
On Sun, 7 Oct 2012, Richard Sandiford wrote:
> > So I think this can't really be selected automatically for all cores,
> > some human-supplied knowledge about the MD unit used is required -- that
> > obviously affects other operations too, e.g. some multiplications
> > involving a constant that may be cheaper to do either directly or with a
> > sequence of additions depending on the MD unit present (unless optimising
> > for size, of course).
>
> Yeah, as far as this optimisation goes, I think your original suggestion
> of using the DFA to work out the best sequence is still right. If the DFA
> says that multiplications take a handful of cycles when they actually take
> 32 then we're not going to optimise multiplication very well whatever happens.
>
> In practice, code destined for non-4kc cores with iterative multipliers
> would probably tune well with -mtune=4kp.
Hmm, maybe, although I find requiring people to tune for an obsolete
MIPS32 rev. 1 processor to get the right MDU performance figures for
modern processors a bit odd (the 4Kp DFA undoubtedly lacks reservations
for newer instructions introduced with the MIPS32r2 architecture update).
I've thought of -miterative-mdu or suchlike a knob that would override
the default cost of multiplication/division as appropriate (i.e. to 32/64
plus any extra operation-specific constant as required), perhaps by
forcing the 4Kp insn reservation (extended to 64 bits as required) over
the usual one; of course there would be nothing to override the -mtune=4Kp
arrangement with. Nothing urgent though, just an idea to ponder.
> The final patch ended up being much more complicated than I'd have liked,
> but at least it should be easier to add other automatically-derived tuning
> costs in future.
Thanks for taking my concerns into account. One nit below.
> Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
> ===================================================================
> --- /dev/null 2012-10-06 09:26:21.400998949 +0100
> +++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c 2012-10-06 22:24:24.857713316 +0100
> @@ -0,0 +1,22 @@
> +/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
> +/* References to RESULT within the loop need to have a higher frequency than
> + references to RESULT outside the loop, otherwise there is no reason
> + to prefer multiply/accumulator registers over GPRs. */
> +/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
> +
> +/* Check that the zero-initialization of the accumulator feeding into
> + the madd is done by means of a mult instruction instead of mthi/mtlo. */
This comment looks reversed to me as in this case we do NOT want a MULT
to be used...
> +
> +NOMIPS16 long long f (int n, int *v, int m)
> +{
> + long long result = 0;
> + int i;
> +
> + for (i = 0; i < n; i++)
> + result = __builtin_mips_madd (result, v[i], m);
> + return result;
> +}
> +
> +/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
> +/* { dg-final { scan-assembler "\tmthi\t" } } */
> +/* { dg-final { scan-assembler "\tmtlo\t" } } */
... and in fact you do check that we didn't get one.
Maciej
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-10-08 23:23 ` Maciej W. Rozycki
@ 2012-10-10 20:00 ` Richard Sandiford
2012-10-11 15:29 ` Maciej W. Rozycki
0 siblings, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2012-10-10 20:00 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: Sandra Loosemore, gcc-patches
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Sun, 7 Oct 2012, Richard Sandiford wrote:
>
>> > So I think this can't really be selected automatically for all cores,
>> > some human-supplied knowledge about the MD unit used is required -- that
>> > obviously affects other operations too, e.g. some multiplications
>> > involving a constant that may be cheaper to do either directly or with a
>> > sequence of additions depending on the MD unit present (unless optimising
>> > for size, of course).
>>
>> Yeah, as far as this optimisation goes, I think your original suggestion
>> of using the DFA to work out the best sequence is still right. If the DFA
>> says that multiplications take a handful of cycles when they actually take
>> 32 then we're not going to optimise multiplication very well whatever happens.
>>
>> In practice, code destined for non-4kc cores with iterative multipliers
>> would probably tune well with -mtune=4kp.
>
> Hmm, maybe, although I find requiring people to tune for an obsolete
> MIPS32 rev. 1 processor to get the right MDU performance figures for
> modern processors a bit odd (the 4Kp DFA undoubtedly lacks reservations
> for newer instructions introduced with the MIPS32r2 architecture update).
>
> I've thought of -miterative-mdu or suchlike a knob that would override
> the default cost of multiplication/division as appropriate (i.e. to 32/64
> plus any extra operation-specific constant as required), perhaps by
> forcing the 4Kp insn reservation (extended to 64 bits as required) over
> the usual one; of course there would be nothing to override the -mtune=4Kp
> arrangement with. Nothing urgent though, just an idea to ponder.
FWIW, using the 4Kp reservation itself wouldn't necessarily work,
because all modern scheduling descriptions define their own CPU units.
You'd probably need to add new reservations to each affected MIPS32
and MIPS64 .md file.
>> Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
>> ===================================================================
>> --- /dev/null 2012-10-06 09:26:21.400998949 +0100
>> +++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c 2012-10-06 22:24:24.857713316 +0100
>> @@ -0,0 +1,22 @@
>> +/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
>> +/* References to RESULT within the loop need to have a higher frequency than
>> + references to RESULT outside the loop, otherwise there is no reason
>> + to prefer multiply/accumulator registers over GPRs. */
>> +/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
>> +
>> +/* Check that the zero-initialization of the accumulator feeding into
>> + the madd is done by means of a mult instruction instead of mthi/mtlo. */
>
> This comment looks reversed to me as in this case we do NOT want a MULT
> to be used...
>
>> +
>> +NOMIPS16 long long f (int n, int *v, int m)
>> +{
>> + long long result = 0;
>> + int i;
>> +
>> + for (i = 0; i < n; i++)
>> + result = __builtin_mips_madd (result, v[i], m);
>> + return result;
>> +}
>> +
>> +/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
>> +/* { dg-final { scan-assembler "\tmthi\t" } } */
>> +/* { dg-final { scan-assembler "\tmtlo\t" } } */
>
> ... and in fact you do check that we didn't get one.
Thanks, fixed as follows.
Richard
gcc/testsuite/
* gcc.target/mips/mips32-dsp-accinit-2.c: Fix test description.
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
===================================================================
--- gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c 2012-10-07 09:45:04.000000000 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c 2012-10-10 20:51:44.402528138 +0100
@@ -5,7 +5,8 @@
/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
/* Check that the zero-initialization of the accumulator feeding into
- the madd is done by means of a mult instruction instead of mthi/mtlo. */
+ the madd is done by means of an mthi & mtlo pair instead of a
+ "mult $0,$0" instruction. */
NOMIPS16 long long f (int n, int *v, int m)
{
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
2012-10-10 20:00 ` Richard Sandiford
@ 2012-10-11 15:29 ` Maciej W. Rozycki
0 siblings, 0 replies; 16+ messages in thread
From: Maciej W. Rozycki @ 2012-10-11 15:29 UTC (permalink / raw)
To: Richard Sandiford; +Cc: Sandra Loosemore, gcc-patches
On Wed, 10 Oct 2012, Richard Sandiford wrote:
> > I've thought of -miterative-mdu or suchlike a knob that would override
> > the default cost of multiplication/division as appropriate (i.e. to 32/64
> > plus any extra operation-specific constant as required), perhaps by
> > forcing the 4Kp insn reservation (extended to 64 bits as required) over
> > the usual one; of course there would be nothing to override the -mtune=4Kp
> > arrangement with. Nothing urgent though, just an idea to ponder.
>
> FWIW, using the 4Kp reservation itself wouldn't necessarily work,
> because all modern scheduling descriptions define their own CPU units.
> You'd probably need to add new reservations to each affected MIPS32
> and MIPS64 .md file.
Thanks for the hint. I have skimmed through documentation I have
available and the following MTI processors implement an iterative MDU as
an RTL option: M4K, M14K, M14Kc, and the following ones always have one:
4Kp, 4KEp. No 64-bit MTI processor had it so far it would seem (the 5K
chips have an iterative divider and a high-performance multiplier).
We currently have no pipeline descriptions for any of the M cores, so my
concern seemed to be a little bit premature (though still valid of
course). The option proposed will only be needed once/if we get any of
the respective pipeline descriptions. Arguably anyone targetting them is
likely going to be after -Os anyway, although at least with the M14Kc I
think it should not be taken for granted.
We're missing a pipeline description for the 4KE-family cores as well.
If not adding costs for the extra MIPS32r2 additions I think at the very
least we should trivially enable the existing bits of 4k.md for these
cores; the pipeline is the same across the whole 4K/4KE family and it's
only the extra instructions that have been added with the E update.
Maciej
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-10-11 15:26 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-16 15:13 [PATCH, MIPS] add new peephole for 74k dspr2 Sandra Loosemore
2012-08-19 17:23 ` Richard Sandiford
2012-08-22 14:26 ` Sandra Loosemore
2012-08-27 16:37 ` Richard Sandiford
2012-09-18 17:18 ` PING " Sandra Loosemore
2012-09-18 18:17 ` Richard Sandiford
2012-09-24 15:49 ` Maciej W. Rozycki
2012-09-24 21:40 ` Richard Sandiford
2012-09-25 0:52 ` Maciej W. Rozycki
2012-09-25 8:38 ` Richard Sandiford
2012-09-25 10:29 ` Richard Sandiford
2012-09-25 18:06 ` Maciej W. Rozycki
2012-10-07 8:45 ` Richard Sandiford
2012-10-08 23:23 ` Maciej W. Rozycki
2012-10-10 20:00 ` Richard Sandiford
2012-10-11 15:29 ` Maciej W. Rozycki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).