* [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs
@ 2014-11-19 13:30 Maciej W. Rozycki
2014-11-21 19:14 ` Moore, Catherine
2014-12-30 0:25 ` [PATCH] MIPS16/GCC: Optimise `__call_stub_' call stubs Maciej W. Rozycki
0 siblings, 2 replies; 4+ messages in thread
From: Maciej W. Rozycki @ 2014-11-19 13:30 UTC (permalink / raw)
To: gcc-patches; +Cc: Catherine Moore, Eric Christopher, Matthew Fortune
Hi,
It has come to my attention that we create suboptimal code for the
`__call_stub_fp_' variant of the MIPS16 call stubs. These stubs are
used for outgoing calls made from MIPS16 code to standard MIPS code that
return floating-point results and may also pass floating-point
arguments.
This can be illustrated with this small program:
$ cat foo.c
double bar (double d);
int main (int argc, char **argv)
{
return bar (argc);
}
$ mips-linux-gnu-gcc -O2 -mips16 -c foo.c
$ mips-linux-gnu-objdump -dr foo.o
foo.o: file format elf32-tradbigmips
Disassembly of section .mips16.call.fp.bar:
00000000 <__call_stub_fp_bar>:
0: 44856000 mtc1 a1,$f12
4: 44846800 mtc1 a0,$f13
8: 03e09021 move s2,ra
c: 0c000000 jal 0 <__call_stub_fp_bar>
c: R_MIPS_26 bar
10: 00000000 nop
14: 44030000 mfc1 v1,$f0
18: 02400008 jr s2
1c: 44020800 mfc1 v0,$f1
Disassembly of section .text.startup:
00000000 <main>:
0: f100 64c4 save 32,ra,s2
4: 1800 0000 jal 0 <main>
4: R_MIPS16_26 __mips16_floatsidf
8: 6500 nop
a: 67a3 move a1,v1
c: 1800 0000 jal 0 <main>
c: R_MIPS16_26 bar
10: 6782 move a0,v0
12: 67a3 move a1,v1
14: 1800 0000 jal 0 <main>
14: R_MIPS16_26 __mips16_fix_truncdfsi
18: 6782 move a0,v0
1a: f100 6444 restore 32,ra,s2
1e: e8a0 jrc ra
-- as you can see in `__call_stub_fp_bar' above the jump delay slot
remains empty because the move to $s2 instruction cannot be scheduled
there due to a data dependency on $ra.
However there is no need to save $ra last and since the MIPS IV ISA
there are no coprocessor move delay slots so the last MTC1 instruction
could be scheduled there instead.
So here is a change to avoid this empty delay slot. With this change
in place we get this code instead:
$ mips-linux-gnu-objdump -dr foo.o
foo.o: file format elf32-tradbigmips
Disassembly of section .mips16.call.fp.bar:
00000000 <__call_stub_fp_bar>:
0: 03e09021 move s2,ra
4: 44856000 mtc1 a1,$f12
8: 0c000000 jal 0 <__call_stub_fp_bar>
8: R_MIPS_26 bar
c: 44846800 mtc1 a0,$f13
10: 44030000 mfc1 v1,$f0
14: 02400008 jr s2
18: 44020800 mfc1 v0,$f1
Disassembly of section .text.startup:
00000000 <main>:
0: f100 64c4 save 32,ra,s2
4: 1800 0000 jal 0 <main>
4: R_MIPS16_26 __mips16_floatsidf
8: 6500 nop
a: 67a3 move a1,v1
c: 1800 0000 jal 0 <main>
c: R_MIPS16_26 bar
10: 6782 move a0,v0
12: 67a3 move a1,v1
14: 1800 0000 jal 0 <main>
14: R_MIPS16_26 __mips16_fix_truncdfsi
18: 6782 move a0,v0
1a: f100 6444 restore 32,ra,s2
1e: e8a0 jrc ra
-- as you can see the last MTC1 instruction has now been placed in the
delay slot.
For ISAs that do have a coprocessor move delay slot there is no gain,
but no loss either, both delay slots remain empty due to data
dependencies (coprocessor move delay slots):
$ mips-linux-gnu-gcc -O2 -mips1 -mips16 -c foo.c
$ mips-linux-gnu-objdump -dr foo.o
foo.o: file format elf32-tradbigmips
Disassembly of section .mips16.call.fp.bar:
00000000 <__call_stub_fp_bar>:
0: 03e09021 move s2,ra
4: 44856000 mtc1 a1,$f12
8: 44846800 mtc1 a0,$f13
c: 0c000000 jal 0 <__call_stub_fp_bar>
c: R_MIPS_26 bar
10: 00000000 nop
14: 44030000 mfc1 v1,$f0
18: 44020800 mfc1 v0,$f1
1c: 02400008 jr s2
20: 00000000 nop
Disassembly of section .text.startup:
00000000 <main>:
0: 63fc addiu sp,-32
2: 6772 move v1,s2
4: 6207 sw ra,28(sp)
6: 1800 0000 jal 0 <main>
6: R_MIPS16_26 __mips16_floatsidf
a: d306 sw v1,24(sp)
c: 67a3 move a1,v1
e: 1800 0000 jal 0 <main>
e: R_MIPS16_26 bar
12: 6782 move a0,v0
14: 67a3 move a1,v1
16: 1800 0000 jal 0 <main>
16: R_MIPS16_26 __mips16_fix_truncdfsi
1a: 6782 move a0,v0
1c: 9606 lw a2,24(sp)
1e: 9707 lw a3,28(sp)
20: 6556 move s2,a2
22: ef00 jr a3
24: 6304 addiu sp,32
26: 6500 nop
-- though I think this consideration is actually academic as no legacy
MIPS16 processors have ever been made that had an FPU I believe (delay
slots do not matter with software FPU emulation). The original MIPS16
implementation, the LSI's TinyRISC cores such as the TR4101 CPU did
provide a COPx interface including possibly an FPU, but to the best of
my knowledge none was actually made. See also the discussion around the
previous iteration of these optimisations here:
http://gcc.gnu.org/ml/gcc-patches/2013-07/msg01414.html
This change has been regression-tested with the mips-linux-gnu target
and these multilibs:
-EB
-EB -mips16
-EB -mmicromips
-EB -mabi=n32
-EB -mabi=64
and the -EL variants of same, with no changes in results.
FWIW there is no need to make a corresponding change to GDB to
interpret the modified stub as the debugger ignores and skips over the
move to $s2 in interpreting stub's code, the instruction can be
anywhere.
I have a second optimisation to make here too, but that triggers a
surprising bug in GNU LD where BFD code meant to discard unused stubs
appears not to work at all. So that'll have to be fixed first and it
also means the other optimisation is unsafe to include in 5.0. I plan
to post it shortly anyway for discussion, once I have the linker bug
fixed.
Meanwhile, OK to apply this change?
2014-11-19 Maciej W. Rozycki <macro@codesourcery.com>
gcc/
* config/mips/mips.c (mips16_build_call_stub): Move the save of
the return address in $18 ahead of passing arguments to FPRs.
Maciej
gcc-mips16-call-fp-stub-s2.patch
Index: gcc-fsf-trunk-quilt/gcc/config/mips/mips.c
===================================================================
--- gcc-fsf-trunk-quilt.orig/gcc/config/mips/mips.c 2014-11-18 13:28:13.508975765 +0000
+++ gcc-fsf-trunk-quilt/gcc/config/mips/mips.c 2014-11-18 23:33:01.418180571 +0000
@@ -6945,6 +6945,17 @@ mips16_build_call_stub (rtx retval, rtx
/* "Save" $sp in itself so we don't use the fake CFA.
This is: DW_CFA_val_expression r29, { DW_OP_reg29 }. */
fprintf (asm_out_file, "\t.cfi_escape 0x16,29,1,0x6d\n");
+
+ /* Save the return address in $18. The stub's caller knows
+ that $18 might be clobbered, even though $18 is usually
+ a call-saved register.
+
+ Do it early on in case the last move to a floating-point
+ register can be scheduled into the delay slot of the
+ call we are about to make. */
+ fprintf (asm_out_file, "\tmove\t%s,%s\n",
+ reg_names[GP_REG_FIRST + 18],
+ reg_names[RETURN_ADDR_REGNUM]);
}
else
{
@@ -6966,11 +6977,7 @@ mips16_build_call_stub (rtx retval, rtx
if (fp_ret_p)
{
- /* Save the return address in $18 and call the non-MIPS16 function.
- The stub's caller knows that $18 might be clobbered, even though
- $18 is usually a call-saved register. */
- fprintf (asm_out_file, "\tmove\t%s,%s\n",
- reg_names[GP_REG_FIRST + 18], reg_names[RETURN_ADDR_REGNUM]);
+ /* Now call the non-MIPS16 function. */
output_asm_insn (MIPS_CALL ("jal", &fn, 0, -1), &fn);
fprintf (asm_out_file, "\t.cfi_register 31,18\n");
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs
2014-11-19 13:30 [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs Maciej W. Rozycki
@ 2014-11-21 19:14 ` Moore, Catherine
2014-12-01 12:57 ` Maciej W. Rozycki
2014-12-30 0:25 ` [PATCH] MIPS16/GCC: Optimise `__call_stub_' call stubs Maciej W. Rozycki
1 sibling, 1 reply; 4+ messages in thread
From: Moore, Catherine @ 2014-11-21 19:14 UTC (permalink / raw)
To: Rozycki, Maciej, gcc-patches; +Cc: Eric Christopher, Matthew Fortune
> -----Original Message-----
> From: Rozycki, Maciej
> Sent: Wednesday, November 19, 2014 8:05 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Moore, Catherine; Eric Christopher; Matthew Fortune
> Subject: [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs
>
>
> 2014-11-19 Maciej W. Rozycki <macro@codesourcery.com>
>
> gcc/
> * config/mips/mips.c (mips16_build_call_stub): Move the save of
> the return address in $18 ahead of passing arguments to FPRs.
>
> Maciej
This looks OK. Please commit.
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs
2014-11-21 19:14 ` Moore, Catherine
@ 2014-12-01 12:57 ` Maciej W. Rozycki
0 siblings, 0 replies; 4+ messages in thread
From: Maciej W. Rozycki @ 2014-12-01 12:57 UTC (permalink / raw)
To: Moore, Catherine; +Cc: gcc-patches, Eric Christopher, Matthew Fortune
On Fri, 21 Nov 2014, Moore, Catherine wrote:
> > gcc/
> > * config/mips/mips.c (mips16_build_call_stub): Move the save of
> > the return address in $18 ahead of passing arguments to FPRs.
> >
> > Maciej
>
> This looks OK. Please commit.
Applied, thanks for your review.
Maciej
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH] MIPS16/GCC: Optimise `__call_stub_' call stubs
2014-11-19 13:30 [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs Maciej W. Rozycki
2014-11-21 19:14 ` Moore, Catherine
@ 2014-12-30 0:25 ` Maciej W. Rozycki
1 sibling, 0 replies; 4+ messages in thread
From: Maciej W. Rozycki @ 2014-12-30 0:25 UTC (permalink / raw)
To: gcc-patches; +Cc: Catherine Moore, Eric Christopher, Matthew Fortune
On Wed, 19 Nov 2014, Maciej W. Rozycki wrote:
> I have a second optimisation to make here too, but that triggers a
> surprising bug in GNU LD where BFD code meant to discard unused stubs
> appears not to work at all. So that'll have to be fixed first and it
> also means the other optimisation is unsafe to include in 5.0. I plan
> to post it shortly anyway for discussion, once I have the linker bug
> fixed.
For posterity -- optimise plain call `__call_stub_' MIPS16 stubs (where
no FP value is returned) that just jump to (tail-call) the actual standard
MIPS function. There is no need to jump via a register here as we know
that:
1. By definition the jump target is going to be standard MIPS code (no
need to relax to JALX ever).
2. We are not linked into PIC code as PIC code uses libgcc.a's indirect
stubs instead so $25 doesn't have to be valid on function's entry.
3. If the target function has been compiled to PIC code, then a PIC stub
will be prepended to the function by LD to load $25 on entry as usually
with non-PIC code.
This shortens the stub from code like:
Disassembly of section .mips16.call.callee_af7:
00000000 <__call_stub_callee_af7>:
0: 3c190000 lui t9,0x0
0: R_MIPS_HI16 callee_af7
4: 27390000 addiu t9,t9,0
4: R_MIPS_LO16 callee_af7
8: 44846000 mtc1 a0,$f12
c: 44877000 mtc1 a3,$f14
10: 44867800 mtc1 a2,$f15
14: 03200008 jr t9
18: 00000000 nop
(taken from the `pr23324' test case at `-O0' which also implies `-O1' for
GAS, i.e. no branch swapping and hence the delay-slot NOP) to code like:
Disassembly of section .mips16.call.callee_af7:
00000000 <__call_stub_callee_af7>:
0: 44846000 mtc1 a0,$f12
4: 44877000 mtc1 a3,$f14
8: 44867800 mtc1 a2,$f15
c: 08000000 j 0 <__call_stub_callee_af7>
c: R_MIPS_26 callee_af7
10: 00000000 nop
and also helps branch prediction (instruction prefetching at the target of
the jump) where available by avoiding an indirect jump.
As noted in the previous message cited above this however triggers a BFD
bug, which I tracked down to a missing feature: call stubs are meant to be
discarded where not needed -- which is where the actual function called is
MIPS16 code -- but that has never been implemented where the actual
function called is local (symbol referred binds locally). This is because
the global symbol hash is used internally by MIPS BFD linker code to check
which MIPS16 stubs have to stay and which ought to be discarded.
Consequently any such stubs associated with local symbols are left
untouched and get through to linker output, wasting storage and runtime
memory space too. When the actual function is MIPS16 code, the linker
then fails as it cannot relax the jump associated with the R_MIPS_26
relocation and bails out:
mips-linux-gnu-ld: pr23324.o: .mips16.call.callee_af7+0xc: Unsupported jump between ISA modes; consider recompiling with interlinking enabled.
mips-linux-gnu-ld: final link failed: Bad value
even though the stub will obviously never execute. With an indirect jump
currently produced the useless stub makes its way to linker output
successfully with the mode switch taken into account in the HI16/LO16
relocations associated with the LUI/ADDIU instruction pair.
Therefore the BFD issue needs to be fixed first before this optimisation
can be made and right now I cannot dive into implementing the missing bit
noted above, so I'm just sharing this change so that it can be used in the
future when BFD has been corrected.
2014-12-29 Maciej W. Rozycki <macro@codesourcery.com>
gcc/
* config/mips/mips.c (mips16_build_call_stub): Emit a direct
jump (and omit the address load) rather than a jump-register
instruction in the tail-call case.
Maciej
gcc-mips16-call-stub-j.patch
Index: gcc-fsf-trunk-quilt/gcc/config/mips/mips.c
===================================================================
--- gcc-fsf-trunk-quilt.orig/gcc/config/mips/mips.c 2014-11-18 23:33:10.917768628 +0000
+++ gcc-fsf-trunk-quilt/gcc/config/mips/mips.c 2014-11-18 23:33:32.417976370 +0000
@@ -6957,19 +6957,6 @@ mips16_build_call_stub (rtx retval, rtx
reg_names[GP_REG_FIRST + 18],
reg_names[RETURN_ADDR_REGNUM]);
}
- else
- {
- /* Load the address of the MIPS16 function into $25. Do this
- first so that targets with coprocessor interlocks can use
- an MFC1 to fill the delay slot. */
- if (TARGET_EXPLICIT_RELOCS)
- {
- output_asm_insn ("lui\t%^,%%hi(%0)", &fn);
- output_asm_insn ("addiu\t%^,%^,%%lo(%0)", &fn);
- }
- else
- output_asm_insn ("la\t%^,%0", &fn);
- }
/* Move the arguments from general registers to floating-point
registers. */
@@ -7037,10 +7024,7 @@ mips16_build_call_stub (rtx retval, rtx
fprintf (asm_out_file, "\t.cfi_endproc\n");
}
else
- {
- /* Jump to the previously-loaded address. */
- output_asm_insn ("jr\t%^", NULL);
- }
+ output_asm_insn (MIPS_CALL ("j", &fn, 0, -1), &fn);
#ifdef ASM_DECLARE_FUNCTION_SIZE
ASM_DECLARE_FUNCTION_SIZE (asm_out_file, stubname, stubdecl);
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-12-29 23:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-19 13:30 [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs Maciej W. Rozycki
2014-11-21 19:14 ` Moore, Catherine
2014-12-01 12:57 ` Maciej W. Rozycki
2014-12-30 0:25 ` [PATCH] MIPS16/GCC: Optimise `__call_stub_' call stubs Maciej W. Rozycki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).